Support Qwen3.5 Dense models on Metal#258
Merged
Merged
Conversation
4cfc2e4 to
528c421
Compare
sempervictus
pushed a commit
to sempervictus/vllm.rs
that referenced
this pull request
Mar 9, 2026
sempervictus
pushed a commit
to sempervictus/vllm.rs
that referenced
this pull request
Mar 9, 2026
guoqingbao
added a commit
that referenced
this pull request
Mar 16, 2026
* Implement Constrained Generation via LLGuidance This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format * Support Qwen3.5 Dense models on Metal (#258) * Utilize SpecialTokens Idiomatic Accessor for EOS Expand SpecialTokens usage to cover EOS uses across the codebase to include the chat template. This gates access to the EOS tokens through a single common API providing an interdiction point to add or remove them as needed per-model or family as requried. * Idiomatic SpecialTokens Access Pattern - Replace manual EOS token extraction logic with centralized SpecialTokens::new() and idiomatic accessors - Eliminate EosTokenId enum and related complex serialization logic in favor of direct Vec<u32> - Update all callers to use SpecialTokens for tool start/end token IDs - Remove stop_token_ids from SamplingParams and related logic (now handled via SpecialTokens) - Simplify tokenizer config by replacing EosTokenEntry with Option<String> - Add comprehensive SpecialTokens API with category-based accessors, ID/string sets, and search methods * More SpecialTokens, Improve Example/Binary Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B * SpecialTokens Strings for Llama4 and Qwen3.5 MoE Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE * ToolConfig Population w/ SpecialTokens * Drop ToolFormat * Lead The Horse to Water, Make Him <Think> This PR introduces a new `reasoning_effort` parameter to control reasoning block generation in the chat completion API, matching OpenAI's reasoning API behavior. - **API Extension**: Added `reasoning_effort` field to `ChatCompletionRequest` accepting "none", "low", "medium", or "high" values (case-insensitive) - **New Module**: Created `src/utils/reasoning.rs` with: - `ReasoningEffort` enum with `from_str` deserialization - `ThinkingGrammarBuilder` for reasoning block grammar construction - `thinking_grammar_with_reasoning_block()` generating Lark grammar patterns - `build_reasoning_grammar()` for composing reasoning blocks with base grammars - **Integration**: Updated `compose_grammars()` in `src/utils/guidance.rs` to accept and apply reasoning effort levels - **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()` in `src/tools/schema.rs` to strip null from required field types, ensuring grammars enforce field presence for tool parameters - **Special Token Helpers**: Added `reasoning_start_ids()`, `reasoning_end_ids()`, and `reasoning_tokens()` methods to `SpecialTokens` for robust token detection - **Comprehensive Tests**: Added 11 new tests covering: - Reasoning effort parsing and validation - Thinking grammar builder functionality - Schema null-stripping for required/optional fields - Grammar composition permutations with reasoning * Tier Reasoning Effort * Anchor XML Tool-Grammar With SpecialTokens Pads This change updates the ToolGrammarBuilder to correctly use pad token IDs for XML tool call termination when building Lark grammars for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5). The XML format requires closing markers for </function> and </parameter> tags. When the tokenizer lacks special closing tags the model can run-on generating forever as XML is not a finite stateless grammar; see guidance-ai/llguidance/issues/306. Use pad tokens as "magic" terminating markers embedded into the grammar and recognizable by the tokenizer/llg mask as not output a model can normally emit in its textual output (masked to 0.0 logprob normaly). Anchor the XML function/parameter generation like we bound tool-call and text. This is easier on the model than forcing JSON parsing (qwen3) especially in conjunction with forcing it to `<think>` if its not trained to do so. Mechanically, we modify the chat template to special pad tags after the closing tags for function and param and inject those into the grammar template submitted to the model as tool-choice. Call path: src/tools/schema.rs:361 build_xml_with_anchors(pad_ids) ├─ Uses pad_ids[0] as </function> anchor └─ Uses pad_ids[1] as </parameter> anchor Grammar structure: - start: ( text | tool_call )+ eos? - tool_call: <[tool_start_id]> tool_content <[tool_end_id]> - tool_0: "<function=fetch_url_via_curl>" param_0_0 ... "</function>" <[pad_id_0]> - param_0_0: "<parameter=url>" value_0_0 ... "</parameter>" <[pad_id_1]> - ... The pad tokens serve as finite termination points for the XML parser, allowing the Lark grammar to generate valid, parseable tool calls without requiring explicit special closing tags in the tokenizer vocabulary. * Cargo fmt * Refactor guided decoding * Update docs * Typo fix * Remove tool grammar & fix slow first token response for sync request * Strip guided-decoding’s leftover tool grammar surface * Fix incorrect guidance application * Revert changes for scheduler.rs (tool call related) * Apply per-sequence guided decoding * Remove redundancy * Fix corner case * Permit empty tool call result --------- Co-authored-by: RageLtMan <rageltman [at] sempervictus> Co-authored-by: Guoqing Bao <topon@outlook.com>
guoqingbao
added a commit
that referenced
this pull request
May 21, 2026
guoqingbao
added a commit
that referenced
this pull request
May 21, 2026
* Implement Constrained Generation via LLGuidance This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format * Support Qwen3.5 Dense models on Metal (#258) * Utilize SpecialTokens Idiomatic Accessor for EOS Expand SpecialTokens usage to cover EOS uses across the codebase to include the chat template. This gates access to the EOS tokens through a single common API providing an interdiction point to add or remove them as needed per-model or family as requried. * Idiomatic SpecialTokens Access Pattern - Replace manual EOS token extraction logic with centralized SpecialTokens::new() and idiomatic accessors - Eliminate EosTokenId enum and related complex serialization logic in favor of direct Vec<u32> - Update all callers to use SpecialTokens for tool start/end token IDs - Remove stop_token_ids from SamplingParams and related logic (now handled via SpecialTokens) - Simplify tokenizer config by replacing EosTokenEntry with Option<String> - Add comprehensive SpecialTokens API with category-based accessors, ID/string sets, and search methods * More SpecialTokens, Improve Example/Binary Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B * SpecialTokens Strings for Llama4 and Qwen3.5 MoE Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE * ToolConfig Population w/ SpecialTokens * Drop ToolFormat * Lead The Horse to Water, Make Him <Think> This PR introduces a new `reasoning_effort` parameter to control reasoning block generation in the chat completion API, matching OpenAI's reasoning API behavior. - **API Extension**: Added `reasoning_effort` field to `ChatCompletionRequest` accepting "none", "low", "medium", or "high" values (case-insensitive) - **New Module**: Created `src/utils/reasoning.rs` with: - `ReasoningEffort` enum with `from_str` deserialization - `ThinkingGrammarBuilder` for reasoning block grammar construction - `thinking_grammar_with_reasoning_block()` generating Lark grammar patterns - `build_reasoning_grammar()` for composing reasoning blocks with base grammars - **Integration**: Updated `compose_grammars()` in `src/utils/guidance.rs` to accept and apply reasoning effort levels - **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()` in `src/tools/schema.rs` to strip null from required field types, ensuring grammars enforce field presence for tool parameters - **Special Token Helpers**: Added `reasoning_start_ids()`, `reasoning_end_ids()`, and `reasoning_tokens()` methods to `SpecialTokens` for robust token detection - **Comprehensive Tests**: Added 11 new tests covering: - Reasoning effort parsing and validation - Thinking grammar builder functionality - Schema null-stripping for required/optional fields - Grammar composition permutations with reasoning * Tier Reasoning Effort * Anchor XML Tool-Grammar With SpecialTokens Pads This change updates the ToolGrammarBuilder to correctly use pad token IDs for XML tool call termination when building Lark grammars for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5). The XML format requires closing markers for </function> and </parameter> tags. When the tokenizer lacks special closing tags the model can run-on generating forever as XML is not a finite stateless grammar; see guidance-ai/llguidance/issues/306. Use pad tokens as "magic" terminating markers embedded into the grammar and recognizable by the tokenizer/llg mask as not output a model can normally emit in its textual output (masked to 0.0 logprob normaly). Anchor the XML function/parameter generation like we bound tool-call and text. This is easier on the model than forcing JSON parsing (qwen3) especially in conjunction with forcing it to `<think>` if its not trained to do so. Mechanically, we modify the chat template to special pad tags after the closing tags for function and param and inject those into the grammar template submitted to the model as tool-choice. Call path: src/tools/schema.rs:361 build_xml_with_anchors(pad_ids) ├─ Uses pad_ids[0] as </function> anchor └─ Uses pad_ids[1] as </parameter> anchor Grammar structure: - start: ( text | tool_call )+ eos? - tool_call: <[tool_start_id]> tool_content <[tool_end_id]> - tool_0: "<function=fetch_url_via_curl>" param_0_0 ... "</function>" <[pad_id_0]> - param_0_0: "<parameter=url>" value_0_0 ... "</parameter>" <[pad_id_1]> - ... The pad tokens serve as finite termination points for the XML parser, allowing the Lark grammar to generate valid, parseable tool calls without requiring explicit special closing tags in the tokenizer vocabulary. * Cargo fmt * Refactor guided decoding * Update docs * Typo fix * Remove tool grammar & fix slow first token response for sync request * Strip guided-decoding’s leftover tool grammar surface * Fix incorrect guidance application * Revert changes for scheduler.rs (tool call related) * Apply per-sequence guided decoding * Remove redundancy * Fix corner case * Permit empty tool call result --------- Co-authored-by: RageLtMan <rageltman [at] sempervictus> Co-authored-by: Guoqing Bao <topon@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.