Support guided decoding by guoqingbao · Pull Request #263 · guoqingbao/xinfer

guoqingbao · 2026-03-13T02:26:48Z

A refactored version based on #262

This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format

Expand SpecialTokens usage to cover EOS uses across the codebase to include the chat template. This gates access to the EOS tokens through a single common API providing an interdiction point to add or remove them as needed per-model or family as requried.

- Replace manual EOS token extraction logic with centralized SpecialTokens::new() and idiomatic accessors - Eliminate EosTokenId enum and related complex serialization logic in favor of direct Vec<u32> - Update all callers to use SpecialTokens for tool start/end token IDs - Remove stop_token_ids from SamplingParams and related logic (now handled via SpecialTokens) - Simplify tokenizer config by replacing EosTokenEntry with Option<String> - Add comprehensive SpecialTokens API with category-based accessors, ID/string sets, and search methods

Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B

Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE

This PR introduces a new `reasoning_effort` parameter to control reasoning block generation in the chat completion API, matching OpenAI's reasoning API behavior. - **API Extension**: Added `reasoning_effort` field to `ChatCompletionRequest` accepting "none", "low", "medium", or "high" values (case-insensitive) - **New Module**: Created `src/utils/reasoning.rs` with: - `ReasoningEffort` enum with `from_str` deserialization - `ThinkingGrammarBuilder` for reasoning block grammar construction - `thinking_grammar_with_reasoning_block()` generating Lark grammar patterns - `build_reasoning_grammar()` for composing reasoning blocks with base grammars - **Integration**: Updated `compose_grammars()` in `src/utils/guidance.rs` to accept and apply reasoning effort levels - **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()` in `src/tools/schema.rs` to strip null from required field types, ensuring grammars enforce field presence for tool parameters - **Special Token Helpers**: Added `reasoning_start_ids()`, `reasoning_end_ids()`, and `reasoning_tokens()` methods to `SpecialTokens` for robust token detection - **Comprehensive Tests**: Added 11 new tests covering: - Reasoning effort parsing and validation - Thinking grammar builder functionality - Schema null-stripping for required/optional fields - Grammar composition permutations with reasoning

This change updates the ToolGrammarBuilder to correctly use pad token IDs for XML tool call termination when building Lark grammars for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5). The XML format requires closing markers for </function> and </parameter> tags. When the tokenizer lacks special closing tags the model can run-on generating forever as XML is not a finite stateless grammar; see guidance-ai/llguidance/issues/306. Use pad tokens as "magic" terminating markers embedded into the grammar and recognizable by the tokenizer/llg mask as not output a model can normally emit in its textual output (masked to 0.0 logprob normaly). Anchor the XML function/parameter generation like we bound tool-call and text. This is easier on the model than forcing JSON parsing (qwen3) especially in conjunction with forcing it to `<think>` if its not trained to do so. Mechanically, we modify the chat template to special pad tags after the closing tags for function and param and inject those into the grammar template submitted to the model as tool-choice. Call path: src/tools/schema.rs:361 build_xml_with_anchors(pad_ids) ├─ Uses pad_ids[0] as </function> anchor └─ Uses pad_ids[1] as </parameter> anchor Grammar structure: - start: ( text | tool_call )+ eos? - tool_call: <[tool_start_id]> tool_content <[tool_end_id]> - tool_0: "<function=fetch_url_via_curl>" param_0_0 ... "</function>" <[pad_id_0]> - param_0_0: "<parameter=url>" value_0_0 ... "</parameter>" <[pad_id_1]> - ... The pad tokens serve as finite termination points for the XML parser, allowing the Lark grammar to generate valid, parseable tool calls without requiring explicit special closing tags in the tokenizer vocabulary.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b02393b8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sempervictus · 2026-03-14T13:43:40Z

Thank you, I held back the tools commit to just get 263 in so I'll need to rewrite a bit of that but the think grammar run-on i fixed in that branch was an expression issue not fitting the template. I really like the @ anchor thing but can't seem to get rust llguidance to resolve those markers. The flat composition pattern works great though.

Absorbing some caffeine and taking a closer look.

guoqingbao · 2026-03-14T14:50:44Z

Thank you, I held back the tools commit to just get 263 in so I'll need to rewrite a bit of that but the think grammar run-on i fixed in that branch was an expression issue not fitting the template. I really like the @ anchor thing but can't seem to get rust llguidance to resolve those markers. The flat composition pattern works great though.

Absorbing some caffeine and taking a closer look.

I think the guided decoding is stable now and no side effects to the conventional tool calling.

guoqingbao · 2026-03-16T02:25:40Z

Finished on #262

RageLtMan and others added 17 commits March 9, 2026 10:12

Support Qwen3.5 Dense models on Metal (#258)

9a1ab79

More SpecialTokens, Improve Example/Binary

2956d5c

Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B

SpecialTokens Strings for Llama4 and Qwen3.5 MoE

0eda75b

Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE

ToolConfig Population w/ SpecialTokens

dbb961d

Drop ToolFormat

621b3a6

Tier Reasoning Effort

f71d7fe

Merge remote-tracking branch 'origin/main' into reasoning/pr

b3f72f2

Cargo fmt

fd53bc6

Refactor guided decoding

786fd24

Update docs

0612ddf

Typo fix

6c80f26

Remove tool grammar & fix slow first token response for sync request

2b02393

chatgpt-codex-connector Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread src/server/mod.rs Outdated

Comment thread src/utils/guidance.rs Outdated

Comment thread src/utils/reasoning.rs

guoqingbao mentioned this pull request Mar 13, 2026

Enable Reasoning via Guided Enforcement #262

Merged

guoqingbao added 4 commits March 14, 2026 08:21

Strip guided-decoding’s leftover tool grammar surface

7f70675

Fix incorrect guidance application

0157c41

Revert changes for scheduler.rs (tool call related)

89341b3

Apply per-sequence guided decoding

ba46d78

Remove redundancy

4997457

guoqingbao added 2 commits March 14, 2026 15:17

Fix corner case

60e88e0

Permit empty tool call result

6d8daae

guoqingbao closed this Mar 16, 2026

guoqingbao deleted the reasoning/pr branch April 18, 2026 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support guided decoding#263

Support guided decoding#263
guoqingbao wants to merge 24 commits into
mainfrom
reasoning/pr

guoqingbao commented Mar 13, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sempervictus commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guoqingbao commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sempervictus commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 14, 2026

Uh oh!

guoqingbao commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guoqingbao commented Mar 13, 2026 •

edited

Loading