Skip to content

Support Qwen3.5 Dense models on Metal#258

Merged
guoqingbao merged 1 commit into
mainfrom
qwen3_5_metal
Mar 8, 2026
Merged

Support Qwen3.5 Dense models on Metal#258
guoqingbao merged 1 commit into
mainfrom
qwen3_5_metal

Conversation

@guoqingbao
Copy link
Copy Markdown
Owner

No description provided.

@guoqingbao guoqingbao merged commit d312088 into main Mar 8, 2026
1 check passed
sempervictus pushed a commit to sempervictus/vllm.rs that referenced this pull request Mar 9, 2026
sempervictus pushed a commit to sempervictus/vllm.rs that referenced this pull request Mar 9, 2026
guoqingbao added a commit that referenced this pull request Mar 16, 2026
* Implement Constrained Generation via LLGuidance

This implements the full llguidance integration enabling
grammar-constrained inference for structured outputs, tool calling,
and custom constraints.

Architecture:
- TopLevelGrammar serialized via rmp_serde across RPC boundaries
- Grammar flows: Server → params.grammar → Runner → GuidanceState
 → Matcher
- Inline correction via logits masking during sampling
- Post-process correction via rollback on validation failure

Key components:
- params.grammar field in SamplingParams for RPC serialization
GuidanceState
- GuidanceState::new() with Matcher state management
- GuidanceState::reset() for proper state cleanup
- Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite
loops
- guidance_failed/guidance_mismatch sets cleared on rollback
- Vocab size validation in build_llg_factory()
- Lark grammar generation from tools via
build_tool_call_lark_grammar()

CLI flags:
- --enable-tool-grammar: Auto-build LLG grammar from MCP tools
- --allow-constraint-api: Accept client-provided
structured_outputs/response_format

* Support Qwen3.5 Dense models on Metal (#258)

* Utilize SpecialTokens Idiomatic Accessor for EOS

Expand SpecialTokens usage to cover EOS uses across the codebase
to include the chat template. This gates access to the EOS tokens
through a single common API providing an interdiction point to add
or remove them as needed per-model or family as requried.

* Idiomatic SpecialTokens Access Pattern

- Replace manual EOS token extraction logic with centralized
SpecialTokens::new() and idiomatic accessors
- Eliminate EosTokenId enum and related complex serialization logic
in favor of direct Vec<u32>
- Update all callers to use SpecialTokens for tool start/end token
IDs
- Remove stop_token_ids from SamplingParams and related logic
(now handled via SpecialTokens)
- Simplify tokenizer config by replacing EosTokenEntry with
Option<String>
- Add comprehensive SpecialTokens API with category-based
accessors, ID/string sets, and search methods

* More SpecialTokens, Improve Example/Binary

Improve the binary example to be a handy extractor for models which
developers can use to update special_tokens.rs quickly.

Add tags extracted from Qwen3.5 0.8B

* SpecialTokens Strings for Llama4 and Qwen3.5 MoE

Narrow the Common category search specifically to find string dups
of actually special tokens (handle "aftermarket" models/merges).

Add and test Llama4 and Qwen3.5 MoE

* ToolConfig Population w/ SpecialTokens

* Drop ToolFormat

* Lead The Horse to Water, Make Him <Think>

This PR introduces a new `reasoning_effort` parameter to control
reasoning block generation in the chat completion API, matching OpenAI's
reasoning API behavior.

- **API Extension**: Added `reasoning_effort` field to
`ChatCompletionRequest` accepting "none", "low", "medium", or "high"
values (case-insensitive)

- **New Module**: Created `src/utils/reasoning.rs` with:
  - `ReasoningEffort` enum with `from_str` deserialization
  - `ThinkingGrammarBuilder` for reasoning block grammar
construction
  - `thinking_grammar_with_reasoning_block()` generating
Lark grammar patterns
  - `build_reasoning_grammar()` for composing reasoning blocks
with base grammars

- **Integration**: Updated `compose_grammars()` in
`src/utils/guidance.rs` to accept and apply reasoning effort levels

- **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()`
in `src/tools/schema.rs` to strip null from required field types,
ensuring grammars enforce field presence for tool parameters

- **Special Token Helpers**: Added `reasoning_start_ids()`,
`reasoning_end_ids()`, and `reasoning_tokens()` methods to
`SpecialTokens` for robust token detection

- **Comprehensive Tests**: Added 11 new tests covering:
  - Reasoning effort parsing and validation
  - Thinking grammar builder functionality
  - Schema null-stripping for required/optional fields
  - Grammar composition permutations with reasoning

* Tier Reasoning Effort

* Anchor XML Tool-Grammar With SpecialTokens Pads

This change updates the ToolGrammarBuilder to correctly use pad
token IDs for XML tool call termination when building Lark grammars
for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5).

The XML format requires closing markers for </function> and
</parameter> tags. When the tokenizer lacks special closing
tags the model can run-on generating forever as XML is not a
finite stateless grammar; see guidance-ai/llguidance/issues/306.

Use pad tokens as "magic" terminating markers embedded into the
grammar and recognizable by the tokenizer/llg mask as not output
a model can normally emit in its textual output (masked to 0.0
logprob normaly). Anchor the XML function/parameter generation
like we bound tool-call and text. This is easier on the model
than forcing JSON parsing (qwen3) especially in conjunction with
forcing it to `<think>` if its not trained to do so.

Mechanically, we modify the chat template to special pad tags
after the closing tags for function and param and inject those
into the grammar template submitted to the model as tool-choice.

Call path:
  src/tools/schema.rs:361 build_xml_with_anchors(pad_ids)
    ├─ Uses pad_ids[0] as </function> anchor
    └─ Uses pad_ids[1] as </parameter> anchor

Grammar structure:
-  start: ( text | tool_call )+ eos?
-  tool_call: <[tool_start_id]> tool_content <[tool_end_id]>
-  tool_0: "<function=fetch_url_via_curl>" param_0_0 ...
"</function>" <[pad_id_0]>
-  param_0_0: "<parameter=url>" value_0_0 ... "</parameter>"
<[pad_id_1]>
- ...

The pad tokens serve as finite termination points for the XML
parser, allowing the Lark grammar to generate valid, parseable tool
calls without requiring explicit special closing tags in the
tokenizer vocabulary.

* Cargo fmt

* Refactor guided decoding

* Update docs

* Typo fix

* Remove tool grammar & fix slow first token response for sync request

* Strip guided-decoding’s leftover tool grammar surface

* Fix incorrect guidance application

* Revert changes for scheduler.rs (tool call related)

* Apply per-sequence guided decoding

* Remove redundancy

* Fix corner case

* Permit empty tool call result

---------

Co-authored-by: RageLtMan <rageltman [at] sempervictus>
Co-authored-by: Guoqing Bao <topon@outlook.com>
guoqingbao added a commit that referenced this pull request May 21, 2026
guoqingbao added a commit that referenced this pull request May 21, 2026
* Implement Constrained Generation via LLGuidance

This implements the full llguidance integration enabling
grammar-constrained inference for structured outputs, tool calling,
and custom constraints.

Architecture:
- TopLevelGrammar serialized via rmp_serde across RPC boundaries
- Grammar flows: Server → params.grammar → Runner → GuidanceState
 → Matcher
- Inline correction via logits masking during sampling
- Post-process correction via rollback on validation failure

Key components:
- params.grammar field in SamplingParams for RPC serialization
GuidanceState
- GuidanceState::new() with Matcher state management
- GuidanceState::reset() for proper state cleanup
- Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite
loops
- guidance_failed/guidance_mismatch sets cleared on rollback
- Vocab size validation in build_llg_factory()
- Lark grammar generation from tools via
build_tool_call_lark_grammar()

CLI flags:
- --enable-tool-grammar: Auto-build LLG grammar from MCP tools
- --allow-constraint-api: Accept client-provided
structured_outputs/response_format

* Support Qwen3.5 Dense models on Metal (#258)

* Utilize SpecialTokens Idiomatic Accessor for EOS

Expand SpecialTokens usage to cover EOS uses across the codebase
to include the chat template. This gates access to the EOS tokens
through a single common API providing an interdiction point to add
or remove them as needed per-model or family as requried.

* Idiomatic SpecialTokens Access Pattern

- Replace manual EOS token extraction logic with centralized
SpecialTokens::new() and idiomatic accessors
- Eliminate EosTokenId enum and related complex serialization logic
in favor of direct Vec<u32>
- Update all callers to use SpecialTokens for tool start/end token
IDs
- Remove stop_token_ids from SamplingParams and related logic
(now handled via SpecialTokens)
- Simplify tokenizer config by replacing EosTokenEntry with
Option<String>
- Add comprehensive SpecialTokens API with category-based
accessors, ID/string sets, and search methods

* More SpecialTokens, Improve Example/Binary

Improve the binary example to be a handy extractor for models which
developers can use to update special_tokens.rs quickly.

Add tags extracted from Qwen3.5 0.8B

* SpecialTokens Strings for Llama4 and Qwen3.5 MoE

Narrow the Common category search specifically to find string dups
of actually special tokens (handle "aftermarket" models/merges).

Add and test Llama4 and Qwen3.5 MoE

* ToolConfig Population w/ SpecialTokens

* Drop ToolFormat

* Lead The Horse to Water, Make Him <Think>

This PR introduces a new `reasoning_effort` parameter to control
reasoning block generation in the chat completion API, matching OpenAI's
reasoning API behavior.

- **API Extension**: Added `reasoning_effort` field to
`ChatCompletionRequest` accepting "none", "low", "medium", or "high"
values (case-insensitive)

- **New Module**: Created `src/utils/reasoning.rs` with:
  - `ReasoningEffort` enum with `from_str` deserialization
  - `ThinkingGrammarBuilder` for reasoning block grammar
construction
  - `thinking_grammar_with_reasoning_block()` generating
Lark grammar patterns
  - `build_reasoning_grammar()` for composing reasoning blocks
with base grammars

- **Integration**: Updated `compose_grammars()` in
`src/utils/guidance.rs` to accept and apply reasoning effort levels

- **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()`
in `src/tools/schema.rs` to strip null from required field types,
ensuring grammars enforce field presence for tool parameters

- **Special Token Helpers**: Added `reasoning_start_ids()`,
`reasoning_end_ids()`, and `reasoning_tokens()` methods to
`SpecialTokens` for robust token detection

- **Comprehensive Tests**: Added 11 new tests covering:
  - Reasoning effort parsing and validation
  - Thinking grammar builder functionality
  - Schema null-stripping for required/optional fields
  - Grammar composition permutations with reasoning

* Tier Reasoning Effort

* Anchor XML Tool-Grammar With SpecialTokens Pads

This change updates the ToolGrammarBuilder to correctly use pad
token IDs for XML tool call termination when building Lark grammars
for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5).

The XML format requires closing markers for </function> and
</parameter> tags. When the tokenizer lacks special closing
tags the model can run-on generating forever as XML is not a
finite stateless grammar; see guidance-ai/llguidance/issues/306.

Use pad tokens as "magic" terminating markers embedded into the
grammar and recognizable by the tokenizer/llg mask as not output
a model can normally emit in its textual output (masked to 0.0
logprob normaly). Anchor the XML function/parameter generation
like we bound tool-call and text. This is easier on the model
than forcing JSON parsing (qwen3) especially in conjunction with
forcing it to `<think>` if its not trained to do so.

Mechanically, we modify the chat template to special pad tags
after the closing tags for function and param and inject those
into the grammar template submitted to the model as tool-choice.

Call path:
  src/tools/schema.rs:361 build_xml_with_anchors(pad_ids)
    ├─ Uses pad_ids[0] as </function> anchor
    └─ Uses pad_ids[1] as </parameter> anchor

Grammar structure:
-  start: ( text | tool_call )+ eos?
-  tool_call: <[tool_start_id]> tool_content <[tool_end_id]>
-  tool_0: "<function=fetch_url_via_curl>" param_0_0 ...
"</function>" <[pad_id_0]>
-  param_0_0: "<parameter=url>" value_0_0 ... "</parameter>"
<[pad_id_1]>
- ...

The pad tokens serve as finite termination points for the XML
parser, allowing the Lark grammar to generate valid, parseable tool
calls without requiring explicit special closing tags in the
tokenizer vocabulary.

* Cargo fmt

* Refactor guided decoding

* Update docs

* Typo fix

* Remove tool grammar & fix slow first token response for sync request

* Strip guided-decoding’s leftover tool grammar surface

* Fix incorrect guidance application

* Revert changes for scheduler.rs (tool call related)

* Apply per-sequence guided decoding

* Remove redundancy

* Fix corner case

* Permit empty tool call result

---------

Co-authored-by: RageLtMan <rageltman [at] sempervictus>
Co-authored-by: Guoqing Bao <topon@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant