Support Qwen3.5 Dense models on Metal by guoqingbao · Pull Request #258 · guoqingbao/xinfer

guoqingbao · 2026-03-07T12:53:17Z

No description provided.

* Implement Constrained Generation via LLGuidance This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format * Support Qwen3.5 Dense models on Metal (#258) * Utilize SpecialTokens Idiomatic Accessor for EOS Expand SpecialTokens usage to cover EOS uses across the codebase to include the chat template. This gates access to the EOS tokens through a single common API providing an interdiction point to add or remove them as needed per-model or family as requried. * Idiomatic SpecialTokens Access Pattern - Replace manual EOS token extraction logic with centralized SpecialTokens::new() and idiomatic accessors - Eliminate EosTokenId enum and related complex serialization logic in favor of direct Vec<u32> - Update all callers to use SpecialTokens for tool start/end token IDs - Remove stop_token_ids from SamplingParams and related logic (now handled via SpecialTokens) - Simplify tokenizer config by replacing EosTokenEntry with Option<String> - Add comprehensive SpecialTokens API with category-based accessors, ID/string sets, and search methods * More SpecialTokens, Improve Example/Binary Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B * SpecialTokens Strings for Llama4 and Qwen3.5 MoE Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE * ToolConfig Population w/ SpecialTokens * Drop ToolFormat * Lead The Horse to Water, Make Him <Think> This PR introduces a new `reasoning_effort` parameter to control reasoning block generation in the chat completion API, matching OpenAI's reasoning API behavior. - **API Extension**: Added `reasoning_effort` field to `ChatCompletionRequest` accepting "none", "low", "medium", or "high" values (case-insensitive) - **New Module**: Created `src/utils/reasoning.rs` with: - `ReasoningEffort` enum with `from_str` deserialization - `ThinkingGrammarBuilder` for reasoning block grammar construction - `thinking_grammar_with_reasoning_block()` generating Lark grammar patterns - `build_reasoning_grammar()` for composing reasoning blocks with base grammars - **Integration**: Updated `compose_grammars()` in `src/utils/guidance.rs` to accept and apply reasoning effort levels - **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()` in `src/tools/schema.rs` to strip null from required field types, ensuring grammars enforce field presence for tool parameters - **Special Token Helpers**: Added `reasoning_start_ids()`, `reasoning_end_ids()`, and `reasoning_tokens()` methods to `SpecialTokens` for robust token detection - **Comprehensive Tests**: Added 11 new tests covering: - Reasoning effort parsing and validation - Thinking grammar builder functionality - Schema null-stripping for required/optional fields - Grammar composition permutations with reasoning * Tier Reasoning Effort * Anchor XML Tool-Grammar With SpecialTokens Pads This change updates the ToolGrammarBuilder to correctly use pad token IDs for XML tool call termination when building Lark grammars for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5). The XML format requires closing markers for </function> and </parameter> tags. When the tokenizer lacks special closing tags the model can run-on generating forever as XML is not a finite stateless grammar; see guidance-ai/llguidance/issues/306. Use pad tokens as "magic" terminating markers embedded into the grammar and recognizable by the tokenizer/llg mask as not output a model can normally emit in its textual output (masked to 0.0 logprob normaly). Anchor the XML function/parameter generation like we bound tool-call and text. This is easier on the model than forcing JSON parsing (qwen3) especially in conjunction with forcing it to `<think>` if its not trained to do so. Mechanically, we modify the chat template to special pad tags after the closing tags for function and param and inject those into the grammar template submitted to the model as tool-choice. Call path: src/tools/schema.rs:361 build_xml_with_anchors(pad_ids) ├─ Uses pad_ids[0] as </function> anchor └─ Uses pad_ids[1] as </parameter> anchor Grammar structure: - start: ( text | tool_call )+ eos? - tool_call: <[tool_start_id]> tool_content <[tool_end_id]> - tool_0: "<function=fetch_url_via_curl>" param_0_0 ... "</function>" <[pad_id_0]> - param_0_0: "<parameter=url>" value_0_0 ... "</parameter>" <[pad_id_1]> - ... The pad tokens serve as finite termination points for the XML parser, allowing the Lark grammar to generate valid, parseable tool calls without requiring explicit special closing tags in the tokenizer vocabulary. * Cargo fmt * Refactor guided decoding * Update docs * Typo fix * Remove tool grammar & fix slow first token response for sync request * Strip guided-decoding’s leftover tool grammar surface * Fix incorrect guidance application * Revert changes for scheduler.rs (tool call related) * Apply per-sequence guided decoding * Remove redundancy * Fix corner case * Permit empty tool call result --------- Co-authored-by: RageLtMan <rageltman [at] sempervictus> Co-authored-by: Guoqing Bao <topon@outlook.com>

Support Qwen3.5 Dense models on Metal

528c421

guoqingbao force-pushed the qwen3_5_metal branch from 4cfc2e4 to 528c421 Compare March 8, 2026 15:06

guoqingbao merged commit d312088 into main Mar 8, 2026
1 check passed

sempervictus pushed a commit to sempervictus/vllm.rs that referenced this pull request Mar 9, 2026

Support Qwen3.5 Dense models on Metal (guoqingbao#258)

c4d367b

sempervictus pushed a commit to sempervictus/vllm.rs that referenced this pull request Mar 9, 2026

Support Qwen3.5 Dense models on Metal (guoqingbao#258)

9a1ab79

guoqingbao added a commit that referenced this pull request May 21, 2026

Support Qwen3.5 Dense models on Metal (#258)

1e3099b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Qwen3.5 Dense models on Metal#258

Support Qwen3.5 Dense models on Metal#258
guoqingbao merged 1 commit into
mainfrom
qwen3_5_metal

guoqingbao commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guoqingbao commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant