LLG: Comprehensive Guided Decoding Infrastructure#265
Conversation
|
@guoqingbao - there's a grammar playground in examples while this is draft and it spits GRAMMAR lines into the log when they are generated for validation. Even if the rest is ready we probably want to remove those before merge (if you decide to merge while i'm in transit for GTC). Current state works correctly even with XML generation for the 0.8 3.5 which makes it a tool-call monster :-). vllm-rs-svc2 | 2026-03-15T15:53:52.644891Z INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2 | start: ( text | tool_call )+ eos
vllm-rs-svc2 | text: /(?s:.*)/
vllm-rs-svc2 | tool_call: <[151657]> "\n" tool_content <[151658]>
vllm-rs-svc2 | value_0_0: /[ -~]*?/
vllm-rs-svc2 | param_0_0: "<parameter=url>" "\n" value_0_0 "\n" "</parameter>" "\n"
vllm-rs-svc2 | value_0_1: /[ -~]*?/
vllm-rs-svc2 | param_0_1: "<parameter=proxy>" "\n" value_0_1 "\n" "</parameter>" "\n"
vllm-rs-svc2 | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 (param_0_1)? "</function>" "\n"
vllm-rs-svc2 | value_1_0: /[ -~]*?/
vllm-rs-svc2 | param_1_0: "<parameter=path>" "\n" value_1_0 "\n" "</parameter>" "\n"
vllm-rs-svc2 | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2 | value_2_0: /[ -~]*?/
vllm-rs-svc2 | param_2_0: "<parameter=path>" "\n" value_2_0 "\n" "</parameter>" "\n"
vllm-rs-svc2 | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2 | tool_3: "<function=get_current_time>" "\n" "</function>" "\n"
vllm-rs-svc2 | value_4_0: /[ -~]*?/
vllm-rs-svc2 | param_4_0: "<parameter=query>" "\n" value_4_0 "\n" "</parameter>" "\n"
vllm-rs-svc2 | value_4_1: /[ -~]*?/
vllm-rs-svc2 | param_4_1: "<parameter=searxng>" "\n" value_4_1 "\n" "</parameter>" "\n"
vllm-rs-svc2 | tool_4: "<function=web_search_searxng>" "\n" param_4_0 (param_4_1)? "</function>" "\n"
vllm-rs-svc2 | tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
vllm-rs-svc2 | eos: <[248046]>
vllm-rs-svc2 | turns out the EOS thing was killing us with multimodals but secondly on XML we have to have the pedantic literals for Re: EOS - i would like to convert |
|
The last two commits are in testing now, apologies for the branch depth will compress it all to one commit before merge - on the road for a bit so working remote and using GH as a pivot |
|
Are you able rebase on main? PR #262 has been merged. |
e789778 to
862b18b
Compare
|
Done sir. Will see how much I can get done on plane. |
|
@guoqingbao if you have a couple of minutes, mind looking at the last commit? Trying to be graceful with this and not break
|
|
Apologies for any compile issues, I can only run build tests on cuda GPU systems now due to the attentionrs change and haven't had time to dig into fixing that while at gtc (the netbook I have w me is igpu only and we don't support mkl yet) |
No hurry, this is an optional feature. |
|
Agreed although having both openai and anthropic capabilities + user supplied chain of thought mechanisms opens the doors to some awesome appdev options both using vllmrs as a standalone bin and even moreso as a lib. One example being running a preprocessing micromodel to define grammars for the big one actually doing the work in a single runtime, another being long context w small models on vram constrained systems. That said, it may be handy to cherry pick the eos commit while I'm at gtc and slower than usual to complete the PR - I'm betting that's why we see those run-on problems w newer qwens on older gear: they're grabbing EOS vs EOM which is also why we occasionally see the EOM rendered in their output as its not being masked. Something about naive attention seems to allow them to omit like that but this should make both EOS and EOM work. I need to understand attention computation product better to prove that but the behavior is 1:1 with what happened here before I switched from the multimodal EOS single type to the EOM ID or now multi type |
90e39de to
443f552
Compare
|
@guoqingbao - some decent progress :-)
vllm-rs-svc2 | 2026-03-22T05:28:19.885268Z INFO vllm_rs::utils::guidance: GRAMMAR:
vllm-rs-svc2 | start: ( text | tool_call )+ eos
vllm-rs-svc2 | tool_content: tool_0 ? tool_1 ? tool_2 ? tool_3 ? tool_4
vllm-rs-svc2 | tool_0: "<function=fetch_url_via_curl>" "\n" param_0_0 (param_0_1)? "</function>" "\n"
vllm-rs-svc2 | tool_call: <[248058]> "\n" tool_content "\n" <[248059]> "\n"
vllm-rs-svc2 | tool_1: "<function=fs_cat>" "\n" param_1_0 "</function>" "\n"
vllm-rs-svc2 | tool_2: "<function=fs_ls>" "\n" param_2_0 "</function>" "\n"
vllm-rs-svc2 | tool_3: "<function=get_current_time>" "\n" "</function>" "\n"
vllm-rs-svc2 | tool_4: "<function=web_search_searxng>" "\n" param_4_0 (param_4_1)? "</function>" "\n"
vllm-rs-svc2 | param_0_0: "<parameter=url>" "\n" value_0_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_0_1: "<parameter=proxy>" "\n" value_0_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_1_0: "<parameter=path>" "\n" value_1_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_2_0: "<parameter=path>" "\n" value_2_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_0: "<parameter=query>" "\n" value_4_0_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | param_4_1: "<parameter=searxng>" "\n" value_4_1_string "\n" "</parameter>" "\n"
vllm-rs-svc2 | text: /(?s:.+?)/
vllm-rs-svc2 | value_0_0_string: %json {"type":"string","description":"The URL to scrape."}
vllm-rs-svc2 | value_0_1_string: %json {"type":"string","description":"The proxy URL in the format protocol://host:port."}
vllm-rs-svc2 | value_1_0_string: %json {"type":"string","description":"The path of the file to read"}
vllm-rs-svc2 | value_2_0_string: %json {"type":"string","description":"The path of the directory to list"}
vllm-rs-svc2 | value_4_0_string: %json {"type":"string","description":"The query to search for."}
vllm-rs-svc2 | value_4_1_string: %json {"type":"string","description":"Optional searxng URL overriding the env var"}
vllm-rs-svc2 | eos: ( <[248044]> | <[248046]> )
|
|
All three variants of grammar-induced tool-calls are working for Q3.5/Next from 0.8->122, tested on the older Q3 0.6 too with excellent results (not the XML type obviously for that model):
Functionally-speaking this is a huge help for small models or non-flash attention models (V100s running the various 27/35B 3.5s or the 80B coder seem very happy about this). The LLG 1.7 bump helps but it still seems preferable to use the Will push cleanup code shortly and collapse the commit for review. |
|
@guoqingbao apologies i didn't take the draft tag off last night. I think this is at a good place for a review boundary. You use different clients and my biggest question about the logic of this state is "how do they respond to the multi-tool-call logic?" since this permits both JSON and XML models to put multiple calls in one Propose splitting-out the following task areas to future PRs:
In terms of this work: the grammar control seems most useful on small models with any attention mechanism, any models not using |
0b55df7 to
68554aa
Compare
I’ll take this on once the decoding cache mismatch issue is resolved. P.S. I’m also casually working on a bot project in Rust. I can loop you in for initial usage if you’re interested. |
|
Thanks, this is a big one to QA/manage and we absolutely want to align caches with what we generate here because we have full output control and might actually be able to simplify caching strategies using this once they prove solid with all consumers. |
of course - i'm doing same for that coreui workbench, next will be wiring sandboxing and stakpak/n8n-rs/etc into it for our engineers. Will add you to the repo soon as i push it as it will come with a |
|
@guoqingbao you should see an invite to a debugging tool repo for this PR - it exercises the right portions of the internal pipeline but might be of help w #277 / #279 too |
114445c to
5c51592
Compare
|
@guoqingbao - a littlebit of "hackery" on what we send to the tokenizer seems to do the trick: $ curl -s http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "default",
"messages": [{"role": "user", "content": "Evaluate the purpose of your execution"}],
"thinking": true,
"reasoning_effort": "high",
"max_tokens": 2000
}'|jq -r '.choices[0].message.content'
A bit more adjustment of the grammar probably needed to make that more useful but the building blocks all work correctly now. |
|
@guoqingbao here is the current code, stripping start-think-tag, working correctly on a 122B q35 with prefix cache matching: ...
2026-03-29T15:18:49.491680Z WARN vllm_rs::server::server: Tools enabled for request
2026-03-29T15:18:49.492814Z INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T15:18:49.543518Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 2, 30149 tokens] received! (session_id: None)
2026-03-29T15:18:49.543594Z INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T15:18:49.544224Z INFO vllm_rs::core::block_manager: Prefix cache hit seq 2 (26304 cached tokens, 411 blocks)
2026-03-29T15:18:49.546848Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 2 (cached 26304 tokens)
...
2026-03-29T15:18:50.360038Z INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos: ( <[248046]> <[248044]> )
...
2026-03-29T15:18:50.371090Z WARN vllm_rs::core::scheduler: Seq 2 - chunk prefill finished (30149 tokens)
2026-03-29T15:18:50.371113Z INFO vllm_rs::core::engine: Prefilling [seq_id 2]: 30150 tokens in 0.86s (35098.95 tokens/s, cache included)prefix cache keeps working through dozens of thinking+output+tool-call iterations: 2026-03-29T16:23:14.890209Z INFO vllm_rs::server::server: [Seq 43] ⏱️ Prompt: 202405 tokens in 1.09s (185522.47 t/s)
2026-03-29T16:23:14.890217Z INFO vllm_rs::server::server: [Seq 43] ⏱️ Decoded: 375 tokens in 8.14s (46.09 t/s)to include proper stream-parsing of tools and reasoning tags: Details2026-03-29T16:23:05.531501Z INFO vllm_rs::core::engine: [llg] Guidance enabled, trimming <think> from pre-generation. Generation starting at <|im_start|>assistant
2026-03-29T16:23:05.692275Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 43, 202405 tokens] received! (session_id: None)
2026-03-29T16:23:05.692366Z INFO vllm_rs::server::parser: Tool parser selected: qwen_coder (model_id=Qwen/Qwen3.5-122B-A10B-FP8, enforce_parser=none)
2026-03-29T16:23:05.694077Z INFO vllm_rs::core::block_manager: Prefix cache hit seq 43 (199040 cached tokens, 3110 blocks)
2026-03-29T16:23:05.696627Z INFO vllm_rs::core::runner: Restored mamba prefix state for seq 43 (cached 199040 tokens)
...
2026-03-29T16:23:06.742515Z INFO vllm_rs::utils::guidance: GRAMMAR:
start: reasoning_block ( text | tool_call )+ eos
reasoning_block: <[248068]> "\n" think_text "\n" <[248069]> "\n"
think_text[suffix="\n"]: /[ -~]+/
tool_call: <[248058]> text <[248059]>
text: /(?s:.+?)/
eos: ( <[248046]> <[248044]> )
2026-03-29T16:23:06.753581Z WARN vllm_rs::core::scheduler: Seq 43 - chunk prefill finished (202405 tokens)
2026-03-29T16:23:06.753687Z INFO vllm_rs::core::engine: Prefilling [seq_id 43]: 202406 tokens in 1.09s (185523.38 tokens/s, cache included)
2026-03-29T16:23:07.068805Z INFO vllm_rs::server::parser: Tool call <tool_call> (248058) found, start buffering!
2026-03-29T16:23:11.765408Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [43]], avg. 46 tokens/s per request (total: 46 tokens/s)
2026-03-29T16:23:14.887214Z INFO vllm_rs::core::scheduler: [Seq 43] Detected </tool_call> token 248059, finishing for external handling
2026-03-29T16:23:14.888253Z INFO vllm_rs::core::block_manager: Prefix cache insert seq 43 (202780 tokens, 3168 blocks)
2026-03-29T16:23:14.889646Z INFO vllm_rs::server::parser: Tool call buffering end, reached </tool_call> (248059)
2026-03-29T16:23:14.889808Z INFO vllm_rs::server::parser: Building tool call: [StreamingToolCallState { name: Some("attempt_completion"), arguments: "{\"result\": \"## Implementation Plan: Reasoning Grammar Detection and Chat Template Validation\\n\\n### Concern 1: Detect Reasoning Grammars in `engine.rs::apply_template()`\\n\\n**Current State**: The code at line 1073 in `engine.rs` already has the detection logic:\\n```rust\\nif let Some(grammar) = params.grammar && is_reasoning_grammar(&grammar) {\\n prompt_template.set_enable_thinking(true);\\n}\\n```\\n\\n**Function Location**: [`is_reasoning_grammar()`](src/utils/reasoning.rs:262) in `src/utils/reasoning.rs`\\n\\n**Detection Logic**:\\n- Converts `TopLevelGrammar` to JSON string representation\\n- Checks for reasoning-specific patterns: `reasoning_block`, `think_text`, `analysis_block`, `draft_phase`, `verification_phase`\\n\\n**Action Required**: None - this functionality already exists and is integrated.\\n\\n---\\n\\n### Concern 2: Chat Template Reasoning Tag Validation\\n\\n**Objective**: Ensure reasoning grammars are only generated when the chat template contains matching reasoning tags.\\n\\n**Current Gap**: No validation exists to check if the chat template has reasoning capability before generating reasoning grammars.\\n\\n**Proposed Solution**: Add a validation function in `src/utils/chat_template.rs`:\\n\\n```rust\\n/// Check if chat template contains reasoning-related markers\\npub fn template_supports_reasoning(template_content: &str) -> bool {\\n // Look for common reasoning tag patterns\\n let reasoning_markers = [\\n \\\"<|begin▁of▁thought|>\\\",\\n \\\"<|end▁of▁thought|>\\\",\\n \\\"\"}" }]
2026-03-29T16:23:14.890155Z INFO vllm_rs::tools::helpers: Valid tool call(s): attempt_completion(args={"result":"## Implementation Plan: Reasoning Grammar Detection and Chat Template Validation\n\n### Concern 1: Detect Reasoning Grammars in `engine.rs::apply_tem...)
2026-03-29T16:23:14.890187Z INFO vllm_rs::server::server: Final chunk emitted after tool-call delta chunk(s): ChatCompletionChunk { id: "seq-43", object: "chat.completion.chunk", created: 1774801385525, model: "g60", choices: [ChatChoiceChunk { index: 0, delta: Delta { role: None, content: None, tool_calls: None }, finish_reason: Some("tool_calls"), error: None }], usage: Some(Usage { prompt_tokens: 202405, completion_tokens: 375, total_tokens: 202780 }) }
2026-03-29T16:23:14.890205Z WARN vllm_rs::server::server: --- Performance Metrics ---
2026-03-29T16:23:14.890209Z INFO vllm_rs::server::server: [Seq 43] ⏱️ Prompt: 202405 tokens in 1.09s (185522.47 t/s)
2026-03-29T16:23:14.890217Z INFO vllm_rs::server::server: [Seq 43] ⏱️ Decoded: 375 tokens in 8.14s (46.09 t/s)The resulting IO renders correctly with clients which handle reasoning tags: ^^ is on the 122B - i need a better guard to ensure we cannot enable thinking on models like CoderNext as models without a reasoning block in their template appear much more prone to 'runaway generation' in those blocks than reasoning-capable ones. This is where #279 and this PR might "collide" and why i'm trying to be extra careful in the parallel dev of this one - if i strip that start-think line but disable reasoning entirely then your code will have accounted for two more tokens in the prefix than are actually emitted. That said, same problem: if you emit a start-think-tag in the template but the model doesn't know how to end it, regardless of constraints being enabled or not it will emit nonsense after a little while because it doesn't have a lookup for the closing tag which would allow transition to normal generation through EOS. |
|
That was a weird one but |
1203887 to
f5ccf3c
Compare
|
Found one regression, it seems the tool grammar only suitable for qwen models, here is the GLM4.7 Flash, which works without tool grammar. When tool grammar enabled, it report: |
This introduces a complete grammar composition system built on
llguidance that handles multiple constraint types, tool calling
styles, and reasoning effort levels through a single coherent
architecture.
Key capabilities:
Multi-format constraint grammars via Lark, regex, JSON schema,
choice lists, and structural tags - all normalized to TopLevelGrammar
Tool grammar generation for both Qwen-style JSON format and
Qwen3-Coder XML format with proper parameter ordering and
deduplication
Reasoning effort levels (none, low, medium, high,
chain_of_thought) which wrap base grammars with reasoning block
constraints
GrammarComposer.compose_all_grammars() that merges constraints,
tools, and reasoning in the correct precedence order
5.Thinking fallback via VLLM_RS_PROVIDE_THINKING_FALLBACK that
transforms <[token_id]> syntax to string literals for models not
trained on reasoning tokens, detecting via
chat_template.enable_thinking flag
Schema sanitization that strips unsupported format attributes from
JSON schemas before passing to llguidance Architecture:
- GrammarBuilder trait for composable grammar fragments
- GrammarRequestDispatcher for building grammars from request
context
- GrammarComposer for merging constraint, tool, and reasoning
grammars
- apply_thinking_fallback() for models without reasoning tokens in
template
Examples:
```lark
start: "yes" | "no" eos
eos: ( <[248044]> | <[248046]> )
```
```lark
start: reasoning_block "positive" | "negative" | "neutral" eos
reasoning_block: <[248068]> "\n" think_text "\n" (think_text+ "\n")? "\n" <[248069]> "\n\n"
think_text[suffix="\n"]: /[ -~]+/
eos: ( <[248044]> | <[248046]> )
```
```lark
start: ( text | tool_call )+ eos
text: /(?s:.+?)/
tool_call: <[248058]> tool_content <[248059]>
param_0_0: "\n<parameter=url>\n" value_string
param_0_1: "\n<parameter=proxy>\n" value_string
tool_0: "\n<function=fetch_url_via_curl>" param_0_0 (param_0_1)? "</function>\n"
param_1_0: "\n<parameter=path>\n" value_string
tool_1: "\n<function=fs_cat>" param_1_0 "</function>\n"
param_2_0: "\n<parameter=path>\n" value_string
tool_2: "\n<function=fs_ls>" param_2_0 "</function>\n"
tool_3: "\n<function=get_current_time>\n" "</function>\n"
param_4_0: "\n<parameter=query>\n" value_string
param_4_1: "\n<parameter=searxng>\n" value_string
tool_4: "\n<function=web_search_searxng>" param_4_0 (param_4_1)? "</function>\n"
tool_content: tool_0 | tool_1 | tool_2 | tool_3 | tool_4
value_string[suffix="\n</parameter>\n"]: /[\x20-\x7E\x0A\x0D]+?/
eos: ( <[248044]> | <[248046]> )
```
```lark
start: reasoning_block text eos
reasoning_block: <[248068]> analysis_block critique_block structure_block "\n" <[248069]> "\n\n"
analysis_block: "\n<analysis>\n" analysis_text
analysis_text[suffix="\n</analysis>\n"]: /[\x20-\x7E\x0A\x0D]+?/
critique_block: "\n<critique>\n" critique_text
critique_text[suffix="\n</critique>\n"]: /[\x20-\x7E\x0A\x0D]+?/
structure_block: "\n<structure_response>\n" structure_text
structure_text[suffix="\n</structure_response>\n"]: /[\x20-\x7E\x0A\x0D]+?/
text: %json {"type":"object","properties":{"steps":{"type":"array","items":{"type":"string"}},"final_answer":{"type":"string"}},"required":["steps","final_answer"]}
eos: ( <[248044]> | <[248046]> )
```
Remove template preamble when grammar generation is acctive for control of the entire token-space emitted. guoqingbao#352 testing observed some NVFP4 models insisting on producing a BOS regardless of the template preamble and grammar-constrained generation prevents them from doing so which reduces logbprobs for the remainder of the sequence produced drastically shortening output. Create ReasoningEffort::ModelDefault to replace @guoqingbao's way of preempting a model to reason by way of an open `<think>` tag predicated on the same CLI parameter state as the original template -driven implementation.
Revise reasoning logic to limit max tokens at each lexeme and add a ModelDefault ReasoningLevel to handle the case where no actual reasoning level is provided but `--disable-reasoning` is not set. Make the reasoning text between the reasoning tokens optional to allow opt-out for models not trained to generate in the block. This allows `RedHatAI/Qwen3.6-35B-A3B-NVFP4` to actually produce output atop 4xV100 under guided constraints as using: ```lark start: text eos text: /(?s:.+?)/ eos: ( <[248044]> | <[248046]> ) ``` results in the model emitting an immediate EOS because the first token it produces without constraints is always `<|im_start|>` even with the `<think>` tag preamble provided by the template to prime the model for generating the content of the reasoning block. When it is only allowed to emit from the standard vocabulary and EOS it either outputs a nonsense token and then EOS or just goes straight to EOS. With the expanded constraint which incldues reasoning tokens: ```lark start: reasoning_block text eos reasoning_block: <[248068]> "\n" think_text? "\n\n" <[248069]> "\n" think_text[temperature=0, max_tokens=768]: /(?s:.+?)/ text: /(?s:.+?)/ eos: ( <[248044]> | <[248046]> ) ``` the first token emitted is from the added vocabulary which is at least proximate to the BOS token it's not being allowed to emit. Tokens subsequent to the added vocabulary one appear to be stable for up to a few hundred reasoning tokens then start to collapse and loop until max_tokens of the reasoning block is reached and another special token emitted (which stabilizes output in normal text/tool-call phase). Of note: the larger the input context the more stable this specifc model appears. Moreover when there are no constraints applied, the BOS tag it emits usually has the model role included (ignoring the template preamble) but in some cases such as the open `<think>` tag being "noticed" in generation it outputs absurdities such as `<|im_start|><think>` without the requisite role definition or newline in-between.
Set max_tokens value in the default reasoning block via env var or fall-back to 512 default. Use XINFER_DEFAULT_REASONING_MAX_TOKENS to adjust at startup.
Thank you for checking in on this branch sir. I wrote grammar generators for Gemma4 and MiniMax but haven't run into GLM yet. Agree we need per-model appropriate grammars, for all models. That said... I'll get a hold of a chat template and add that as well. Should i be moving model-specific grammars into the existing model's file or starting a new subtree? Some of the files are getting a bit porcine. Current state is:
The more i get into the maths of what's going on before and after this the more my head hurts. The actual guidance process is fairly simple but the downstream implications on generation from prior enforcement biasing subsequent token selection should result in all sorts of messed up output... but they dont. We're only really eliminating special tokens for most of the generation ( |
|
BTW the gemma4 generator is for its native "JSON-like" format, and it works. I know you reverted to using JSON tool-parsing at some point and i realigned the grammargen for that change but the code is still there if you want to use the really odd but native syntax they wrote for tool calls. |
vllm-rs-svc0 | 2026-06-02T23:56:25.637980Z INFO xinfer::core::engine: Prefilling 1 seq(s) [0]: 1360187 total tokens in 451.52s (3012.48 tokens/s, cache included)
vllm-rs-svc0 | 2026-06-02T23:56:26.253979Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 198
vllm-rs-svc0 | 2026-06-02T23:56:26.253996Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0 | 2026-06-02T23:56:26.253999Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 1688
vllm-rs-svc0 | 2026-06-02T23:56:26.254097Z INFO xinfer::server::parser: Tool call <tool_call> (151657) found, start buffering!
vllm-rs-svc0 | 2026-06-02T23:56:26.292107Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 3152
vllm-rs-svc0 | 2026-06-02T23:56:26.292118Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 10716
vllm-rs-svc0 | 2026-06-02T23:56:26.292122Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 397
vllm-rs-svc0 | 2026-06-02T23:56:26.292125Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0 | 2026-06-02T23:56:26.292127Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 16181
vllm-rs-svc0 | 2026-06-02T23:56:26.292130Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 81940
vllm-rs-svc0 | 2026-06-02T23:56:26.579813Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 198
vllm-rs-svc0 | 2026-06-02T23:56:26.579828Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 27
vllm-rs-svc0 | 2026-06-02T23:56:26.579831Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 16181
vllm-rs-svc0 | 2026-06-02T23:56:26.579834Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 79194
vllm-rs-svc0 | 2026-06-02T23:56:27.252387Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 522
vllm-rs-svc0 | 2026-06-02T23:56:27.252399Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 1688
vllm-rs-svc0 | 2026-06-02T23:56:27.252402Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 397
vllm-rs-svc0 | 2026-06-02T23:56:27.252405Z INFO xinfer::core::scheduler: [Seq 0] Applied ff-token: 151658.... 🤞 |
I made some changes to fix bugs that were already addressed in my previous commits, but your latest force push removed those changes. Please base your work on the most recent updates so we can get this PR merged, rather than overwriting or ignoring them. |
LLGuidance Matcher state machine can produce fast-forward tokens from certain sequence positions followed by known token IDs based on the grammar definition. This permits emitting tokens which must be correct and avoiding forward passes for them entirely and also allows comparison of sampled token to the FF state as an additional validator of sampling correctness relative to mask/FSM. Used token is commited to the FSM regardless of whether sampled or FF ensuring alignmet of FSM increment to sequence position. TODO: currently only ff_tokens[0] is used as a healing mechanism but the function can return many tokens at once. These cannot be handled in the context of sample() as they must be appended to the sequence in the right order during `Schedueler::postprocess()` and cached state along with seq pos and FSM increment must align with each-other before the next forward pass. Draft infrastructure may be useful for handling the associated complexity when stabilized via EAGLE or other mechanism.
The original implementation used hard masking (f32::NEG_INFINITY) which completely eliminates probability mass from disallowed tokens. This creates a gradient cliff that can disrupt the inference loop, especially when grammar constraints appear in text sections that only allow common vocabulary. Soft-masking with configurable parameters: - SoftMaskConfig struct with mask_shift (-1000.0), min_logit (-1e9), and enabled flag - Environment variable XINFER_SOFT_MASK_DISABLED controls soft-masking behavior When enabled, masked tokens get logit = (original_logit - 1000.0).max(-1e9) Mathematical rationale: - F32 range: min = -3.4028235e+38, max = 3.4028235e+38 - Softmax probability for shifted logit: exp(-1000) / sum(exp(logits)) ~ 10^-435 - This is effectively zero for practical purposes while preserving gradient flow - The gradient of softmax at -1000 is non-zero (unlike -inf which has zero gradient) - min_logit = -1e9 is safe: well above f32::MIN, prevents underflow to -inf
Apologies, didn't see the changes - i rebase the local branch off Looking through the history in this GH page it doesn't show commit added to the branch by anyone other than me - am i somehow deleting GH history too when i rebase? :-\ Happen to have a branch from which i can pick those correctly (or just re-add them here and i'll merge everything down to 1 commit)? I'll push the latest state with softscaling and single-token FF (only a recovery function right now, the multi-token FF requires more work to align sequence states/FSM/ |
|
@guoqingbao if you have a chance before you sign off for the night, could you please push any changes you had to this branch? I didn't realize you were pushing here and rebased locally off of |

Comprehensive Guided Decoding Infrastructure
1. Structured Output Constraints (Opt-In Security Model)
Client applications can now request structured outputs via the OpenAI-compatible API using
structured_outputsfield. The system supports:Security Model: All client-provided constraints are BLOCKED by default. Operators must explicitly enable via
--allow-constraint-apiCLI flag. This prevents malicious grammar injection attacks that could:2. Automatic Tool Grammar Generation
When tools are defined in a request, the system can automatically generate grammars that constrain tool calls to valid JSON schemas:
<tool_call>markers and schema validationSecurity Model: Tool grammar generation is BLOCKED by default. Enable via
--enable-tool-grammarCLI flag. When enabled, tools are validated against the defined schema before grammar generation.3. Reasoning Effort Control
The system supports configurable reasoning effort levels that generate model-specific thinking grammar:
Reasoning effort integrates with constraint grammars to produce composed output that includes thinking blocks before final response.
4. Grammar-Only Completion Endpoint
New
/v1/grammarPOST endpoint allows clients to submit Lark/Regex/JSON Schema/Choice grammars directly:System-Level Changes
1. Enhanced Token Infrastructure
bos_token_idstoGuidanceTokensfor proper beginning-of-sequence handling2. Grammar Composition Engine
compose_grammarsnow handles constraint + tool + reasoning combinationsstart: (text | tool_call)+for flexible output3. Security Validation Layer
allow_constraint_apiflag before processing4. Fallback Mechanisms
5. API Contract Extensions
structured_outputs,constraint,constraint_type,reasoning_effort,tool_choice,extra_bodyfields/v1/grammarendpoint with explicit grammar type/content separationsequenceDiagram participant Client participant Server as src/server/mod.rs participant API as src/api.rs participant Guidance as src/utils/guidance.rs participant Engine as src/core/engine.rs participant Runner as src/core/runner.rs Client->>Server: POST /v1/chat/completions Client->>Server: POST /v1/grammar (new) Note over Server: Parse request & extract flags Server->>Server: Check enable_tool_grammar CLI flag Server->>Server: Check allow_constraint_api CLI flag alt Constraint API Disabled Server->>Server: Log warning: "constraint ignored: allow_constraint_api=false" Server->>API: Parse request without constraints else Constraint API Enabled Server->>API: extract_guidance_tokens(tokenizer, eos_ids, bos_ids) API->>Guidance: parse_grammar_from_chat_request(request, allow_constraint_api) Note over Guidance: Validate constraint fields Guidance->>Guidance: Check structured_outputs.choice/regex/json/grammar/structural_tag Guidance->>Guidance: Check response_format.json_schema/json_object Guidance->>Guidance: Check legacy constraint field alt Valid Constraint Found Guidance->>Guidance: Build TopLevelGrammar from constraint Guidance-->>API: Some(grammar) else No Constraint Guidance-->>API: None end end API->>Guidance: generate_grammar_from_request(request, guidance_tokens, enable_tool_grammar, allow_constraint_api, model_type) Note over Guidance: Dual grammar generation path alt enable_tool_grammar=true Guidance->>Guidance: Build XML tool grammar (build_xml_tool_grammar_for_parser) Guidance->>Guidance: Generate schema-based tool grammar with %json else allow_constraint_api=true Guidance->>Guidance: Build fallback tool envelope (build_fallback_tool_envelope_grammar) Guidance->>Guidance: Generate text-tagged tool grammar with start/end tags else Neither enabled Guidance->>Guidance: Return None for tool grammar end alt Reasoning Effort Specified Guidance->>Guidance: Check reasoning tokens available (reasoning_start_ids, reasoning_end_ids) Guidance->>Guidance: Generate reasoning grammar based on effort level Note over Guidance: None/Low/Medium/High/ChainOfThought end Guidance->>Guidance: compose_grammars(constraint_grammars, tool_grammar, tool_choice_required, forced_tool_name, max_tokens, guidance_tokens, reasoning_effort) Note over Guidance: Grammar composition logic Guidance->>Guidance: GrammarComposerBuilder.build(guidance_tokens) alt No constraints, no tools, no reasoning Guidance-->>Engine: None (unconstrained generation) else Constraint only Guidance->>Guidance: GrammarComposers::Constraint(constraint_gram) Guidance-->>Engine: constraint grammar else Tools only Guidance->>Guidance: GrammarComposers::Tool(tool_gram) Guidance-->>Engine: tool grammar else Constraint + Tools Guidance->>Guidance: GrammarComposers::ConstraintOrTool(constraint, tool) Guidance-->>Engine: (constraint | tool)+ else Reasoning + Constraint Guidance->>Guidance: GrammarComposers::WithReasoning(reasoning, inner) Guidance-->>Engine: reasoning_block -> inner grammar end Engine->>Engine: create_engine(config, bos_token_id, guidance_tokens) Engine->>Engine: Extract BOS tokens from tokenizer if not provided Engine->>Runner: SamplingParams { grammar: final_grammar, reasoning_effort, ... } Runner->>Runner: sample_with_grammar(prompt_tokens, grammar, eos_token_ids) Note over Runner: llguidance parser factory creates matcher loop Generation Runner->>Runner: Token generation with grammar constraints Runner->>Runner: EOS token detection alt Reasoning grammar detected Runner->>Runner: Extract reasoning_block content Runner->>Runner: Validate reasoning tokens match start/end markers end alt Tool call detected Runner->>Runner: Parse tool_call content via StreamToolParser Runner->>Runner: Extract function name and arguments end Runner->>Server: Stream token response Server->>Client: SSE stream chunk end Runner->>Runner: Prefix cache lookup (if --prefix-cache) Runner->>Runner: KV cache management (if --pd-server)