Skip to content

Support guided decoding#263

Closed
guoqingbao wants to merge 24 commits into
mainfrom
reasoning/pr
Closed

Support guided decoding#263
guoqingbao wants to merge 24 commits into
mainfrom
reasoning/pr

Conversation

@guoqingbao
Copy link
Copy Markdown
Owner

@guoqingbao guoqingbao commented Mar 13, 2026

A refactored version based on #262

RageLtMan and others added 17 commits March 9, 2026 10:12
This implements the full llguidance integration enabling
grammar-constrained inference for structured outputs, tool calling,
and custom constraints.

Architecture:
- TopLevelGrammar serialized via rmp_serde across RPC boundaries
- Grammar flows: Server → params.grammar → Runner → GuidanceState
 → Matcher
- Inline correction via logits masking during sampling
- Post-process correction via rollback on validation failure

Key components:
- params.grammar field in SamplingParams for RPC serialization
GuidanceState
- GuidanceState::new() with Matcher state management
- GuidanceState::reset() for proper state cleanup
- Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite
loops
- guidance_failed/guidance_mismatch sets cleared on rollback
- Vocab size validation in build_llg_factory()
- Lark grammar generation from tools via
build_tool_call_lark_grammar()

CLI flags:
- --enable-tool-grammar: Auto-build LLG grammar from MCP tools
- --allow-constraint-api: Accept client-provided
structured_outputs/response_format
Expand SpecialTokens usage to cover EOS uses across the codebase
to include the chat template. This gates access to the EOS tokens
through a single common API providing an interdiction point to add
or remove them as needed per-model or family as requried.
- Replace manual EOS token extraction logic with centralized
SpecialTokens::new() and idiomatic accessors
- Eliminate EosTokenId enum and related complex serialization logic
in favor of direct Vec<u32>
- Update all callers to use SpecialTokens for tool start/end token
IDs
- Remove stop_token_ids from SamplingParams and related logic
(now handled via SpecialTokens)
- Simplify tokenizer config by replacing EosTokenEntry with
Option<String>
- Add comprehensive SpecialTokens API with category-based
accessors, ID/string sets, and search methods
Improve the binary example to be a handy extractor for models which
developers can use to update special_tokens.rs quickly.

Add tags extracted from Qwen3.5 0.8B
Narrow the Common category search specifically to find string dups
of actually special tokens (handle "aftermarket" models/merges).

Add and test Llama4 and Qwen3.5 MoE
This PR introduces a new `reasoning_effort` parameter to control
reasoning block generation in the chat completion API, matching OpenAI's
reasoning API behavior.

- **API Extension**: Added `reasoning_effort` field to
`ChatCompletionRequest` accepting "none", "low", "medium", or "high"
values (case-insensitive)

- **New Module**: Created `src/utils/reasoning.rs` with:
  - `ReasoningEffort` enum with `from_str` deserialization
  - `ThinkingGrammarBuilder` for reasoning block grammar
construction
  - `thinking_grammar_with_reasoning_block()` generating
Lark grammar patterns
  - `build_reasoning_grammar()` for composing reasoning blocks
with base grammars

- **Integration**: Updated `compose_grammars()` in
`src/utils/guidance.rs` to accept and apply reasoning effort levels

- **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()`
in `src/tools/schema.rs` to strip null from required field types,
ensuring grammars enforce field presence for tool parameters

- **Special Token Helpers**: Added `reasoning_start_ids()`,
`reasoning_end_ids()`, and `reasoning_tokens()` methods to
`SpecialTokens` for robust token detection

- **Comprehensive Tests**: Added 11 new tests covering:
  - Reasoning effort parsing and validation
  - Thinking grammar builder functionality
  - Schema null-stripping for required/optional fields
  - Grammar composition permutations with reasoning
This change updates the ToolGrammarBuilder to correctly use pad
token IDs for XML tool call termination when building Lark grammars
for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5).

The XML format requires closing markers for </function> and
</parameter> tags. When the tokenizer lacks special closing
tags the model can run-on generating forever as XML is not a
finite stateless grammar; see guidance-ai/llguidance/issues/306.

Use pad tokens as "magic" terminating markers embedded into the
grammar and recognizable by the tokenizer/llg mask as not output
a model can normally emit in its textual output (masked to 0.0
logprob normaly). Anchor the XML function/parameter generation
like we bound tool-call and text. This is easier on the model
than forcing JSON parsing (qwen3) especially in conjunction with
forcing it to `<think>` if its not trained to do so.

Mechanically, we modify the chat template to special pad tags
after the closing tags for function and param and inject those
into the grammar template submitted to the model as tool-choice.

Call path:
  src/tools/schema.rs:361 build_xml_with_anchors(pad_ids)
    ├─ Uses pad_ids[0] as </function> anchor
    └─ Uses pad_ids[1] as </parameter> anchor

Grammar structure:
-  start: ( text | tool_call )+ eos?
-  tool_call: <[tool_start_id]> tool_content <[tool_end_id]>
-  tool_0: "<function=fetch_url_via_curl>" param_0_0 ...
"</function>" <[pad_id_0]>
-  param_0_0: "<parameter=url>" value_0_0 ... "</parameter>"
<[pad_id_1]>
- ...

The pad tokens serve as finite termination points for the XML
parser, allowing the Lark grammar to generate valid, parseable tool
calls without requiring explicit special closing tags in the
tokenizer vocabulary.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b02393b8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/server/mod.rs Outdated
Comment thread src/utils/guidance.rs Outdated
Comment thread src/utils/reasoning.rs
@sempervictus
Copy link
Copy Markdown
Contributor

Thank you, I held back the tools commit to just get 263 in so I'll need to rewrite a bit of that but the think grammar run-on i fixed in that branch was an expression issue not fitting the template. I really like the @ anchor thing but can't seem to get rust llguidance to resolve those markers. The flat composition pattern works great though.

Absorbing some caffeine and taking a closer look.

@guoqingbao
Copy link
Copy Markdown
Owner Author

Thank you, I held back the tools commit to just get 263 in so I'll need to rewrite a bit of that but the think grammar run-on i fixed in that branch was an expression issue not fitting the template. I really like the @ anchor thing but can't seem to get rust llguidance to resolve those markers. The flat composition pattern works great though.

Absorbing some caffeine and taking a closer look.

I think the guided decoding is stable now and no side effects to the conventional tool calling.

@guoqingbao
Copy link
Copy Markdown
Owner Author

Finished on #262

@guoqingbao guoqingbao closed this Mar 16, 2026
@guoqingbao guoqingbao deleted the reasoning/pr branch April 18, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants