guoqingbao · sempervictus · Apr 6, 2026 · May 30, 2026 · May 31, 2026 · May 31, 2026
diff --git a/Cargo.toml b/Cargo.toml
@@ -22,8 +22,8 @@ itertools = "0.13.0"
 akin = "0.4.0"
 indicatif = "0.17.11"
 serde_json = "1.0.108"
-llguidance = { version = "1.6", default-features = false, features = ["lark"] }
-toktrie_hf_tokenizers = "1.6"
+llguidance = { version = "1.7", default-features = false, features = ["lark", "referencing", "jsonschema_validation"] }
+toktrie_hf_tokenizers = "1.7"
 toktrie = "1.4"
 half = { version = "2.5.0", features = ["num-traits", "use-intrinsics", "rand_distr"] }
 tokio = { version = "1.38.0", features = ["sync"] }

diff --git a/ReadMe-CN.md b/ReadMe-CN.md
@@ -177,6 +177,147 @@ xinfer --m unsloth/Qwen3.5-4B-GGUF --f Qwen3.5-4B-Q4_K_M.gguf
 ## 📘 使用方法
 > **Python包安装后**请使用 `python3 -m xinfer.server` 方式运行
 
+### 安装
+
+<details>
+<summary><b>CUDA（Linux）</b></summary>
+
+```bash
+# 前置依赖
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+sudo apt-get install -y git build-essential libssl-dev pkg-config
+
+# 可选：CUDA toolkit + NCCL
+sudo apt-get install -y cuda-nvcc-12-9 cuda-nvrtc-dev-12-9 libcublas-dev-12-9 libcurand-dev-12-9
+sudo apt-get install -y libnccl2 libnccl-dev
+
+# 编译安装
+cargo --install --features cuda,nccl,flashinfer,cutlass
+# Flash Attention 后端：
+cargo --install --features cuda,nccl,flashattn,cutlass
+# V100 / 较老硬件（无 flash 后端）：
+cargo --install --features cuda,nccl
+```
+
+</details>
+
+<details>
+<summary><b>Metal（macOS）</b></summary>
+
+```bash
+# 先安装 Xcode 命令行工具
+cargo install --features metal
+```
+
+</details>
+
+默认启动 **API 服务模式**（端口 8000）。使用 `--i` 启用交互模式 🤖，`--ui-server` 启用带 Web UI 的服务模式 🌐，`--m` 指定Huggingface模型，或`--w` 指定本地Safetensors模型路径 或`--f` 指定GGUF模型文件：
+
+> 单卡/多卡推理
+  <details open>
+    <summary>单卡推理</summary>
+
+   ```bash
+   # CUDA （将 `--i`替换成 `--ui-server`则启用网页版本）
+   vllm-rs --i --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf --kv-fraction 0.8
+   # Metal/MacOS (MacOS Tahoe之前的系统可能会存在生成过慢问题，使用更小的`--max-model-len` 或 `--kv-fraction`减少显存占用)
+   vllm-rs --i --m unsloth/Qwen3.5-4B-GGUF --f Qwen3.5-4B-Q3_K_M.gguf
+   ```
+  </details>
+
+  <details open>
+    <summary>多卡未量化模型</summary>
+
+   ```bash
+   vllm-rs --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --ui-server --prefix-cache
+   ```
+  </details>
+
+  <details open>
+    <summary>FP8/FP4模型</summary>
+
+  _FP8格式:_
+   ```bash
+   vllm-rs --d 0,1 --w /path/Qwen3-Coder-30B-A3B-Instruct-FP8/ --ui-server --prefix-cache
+    # Or Qwen3-Next 80B
+   vllm-rs --m Qwen/Qwen3-Coder-Next-FP8 --ui-server --d 0,1 --prefix-cache
+   ```
+
+  _MXFP4格式:_
+  ```bash
+  vllm-rs --m olka-fi/Qwen3.5-4B-MXFP4 --ui-server --prefix-cache
+  ```
+
+  _NVFP4格式:_
+  ```bash
+  vllm-rs --m AxionML/Qwen3.5-9B-NVFP4 --ui-server --prefix-cache
+  ```
+  </details>
+
+   <details open>
+    <summary>多卡量化模型</summary>
+
+   ```bash
+   vllm-rs --ui-server --d 0,1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --prefix-cache
+   ```
+  </details>
+
+   <details open>
+    <summary>未量化模型运行为Q4K量化模型，同时使用FP8 KVCache</summary>
+
+   ```bash
+   # 编译时去除`flashinfer` 或 `flashattn` 以使用fp8 kvcache
+   vllm-rs --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --server --port 8000 --fp8-kvcache
+   ```
+  </details>
+
+---
+
+## 🔌 结构化输出与约束（Guided Decoding）
+
+vLLM.rs 现在支持通过 llguidance 库实现结构化输出和约束生成。
+
+### ⚠️ 安全说明
+
+**客户端提供的约束默认被阻止。**要启用它们，您必须显式设置 `--allow-constraint-api` 标志。
+
+#### 启用客户端约束
+```bash
+# 启用客户端提交的约束 via HTTP API
+vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --allow-constraint-api
+```
+
+#### 客户端约束的安全风险
+客户端提供的约束可能导致严重的安全漏洞：
+
+1. **Lark 语法注入**：恶意客户端可以提交精心设计的 Lark 语法，这些语法：
+   - 可以访问超出用户角色边界的特殊令牌
+   - 注入可能导致 ReDoS 攻击的任意正则表达式模式
+   - 绕过聊天模板的角色分离
+
+2. **JSON Schema 转义**：客户端可以指定：
+   - 引用系统不打算让用户控制的内部特殊令牌
+   - 创建模糊的令牌边界，导致系统指令泄露
+   - 注入匹配系统角色的禁止正则表达式模式
+
+3. **角色边界 violation**：启用约束后，客户端可能：
+   - 逃逸聊天模板中的 `user:` 角色边界
+   - 注入 `system:` 或 `assistant:` 角色内容
+   - 操纵 tool_call 标记以注入伪造的工具响应
+   - 发明新的方法使设计不佳的系统超出预期范围运行
+
+#### 推荐用法
+- **生产环境**：仅与可信的访问系统/客户端一起设置 `--enable-tool-grammar` 和/或 `--allow-constraint-api`，或在 tokenizer-aware WAF 内联验证语法时过滤传入内容。
+
+```bash
+# 启用自动工具语法生成
+vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --enable-tool-grammar
+```
+
+查看 [**结构化输出文档 →**](docs/llguidance-integration.md)
+
+---
+
 > Docker 内构建请参考 [**在 Docker 中运行 xInfer →**](docs/docker.md)
 
 ### 运行模型

diff --git a/ReadMe.md b/ReadMe.md
@@ -290,6 +290,48 @@ xinfer --m mistralai/Ministral-3-3B --ui-server
 
 ---
 
+## 🔌 Guided decoding (Structured Outputs & Constraints)
+vLLM.rs now supports structured output and constraint-based generation via llguidance.
+
+### ⚠️ Security Notice
+
+**Client-provided constraints are BLOCKED by default.** To enable them, you must explicitly set the `--allow-constraint-api` flag.
+
+#### Enabling Client Constraints
+```bash
+# Enable client-submitted constraints via HTTP API
+vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --allow-constraint-api
+```
+
+#### Security Risks of Client Constraints
+Client-provided constraints can introduce serious security vulnerabilities:
+
+1. **Lark Grammar Injection**: Malicious clients can submit crafted Lark grammars that:
+   - Access special tokens beyond the user role boundary
+   - Inject arbitrary regex patterns that could cause ReDoS attacks
+   - Bypass the chat template's role separation
+
+2. **JSON Schema Escapes**: Clients can specify schemas that:
+   - Reference internal special tokens not intended for user control
+   - Create ambiguous token boundaries that leak system instructions
+   - Inject forbidden regex patterns matching system roles
+
+3. **Role Boundary Violations**: When constraints are enabled, clients can potentially:
+   - Escape the `user:` role boundary in chat templates
+   - Inject `system:` or `assistant:` role content
+   - Manipulate tool_call markers to inject fake tool responses
+   - Invent new ways to make poorly designed systems behave beyond intended scope
+
+#### Recommended Usage
+- **Production**: Set `--enable-tool-grammar` and/or `--allow-constraint-api` with trusted accessor systems/clients or when filtering inbound content through a tokenizer-aware WAF with grammar validation inline.
+
+```bash
+# Enable automatic tool grammar generation
+vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --enable-tool-grammar
+```
+
+See [**Structured Outputs Documentation →**](docs/llguidance-integration.md)
+
 ## 📘 Build from source code
 
 **Option 1 — Cargo**

diff --git a/docs/guided_decoding.md b/docs/guided_decoding.md
@@ -8,13 +8,23 @@ It focuses on:
 - how reasoning effort is applied
 - practical usage and validation commands
 
+## JSON Schema Reference
+
+For detailed JSON Schema constraint documentation with curl examples, see [`llguidance-json-schema.md`](llguidance-json-schema.md).
+
+This covers:
+- Schema type definitions (string, integer, number, boolean, object, array)
+- All supported API endpoints (OpenAI-compatible and Claude server)
+- Complete curl examples for each permutation
+- Schema sanitization behavior
+
 ## Current Model
 
 Guided decoding is request-scoped.
 
 The core engine does not invent grammars on its own. A request either:
 - supplies a constraint grammar
-- gets that constraint grammar composed with a reasoning prefix
+- gets a composed grammar containing both
 - or runs unconstrained when neither exists
 
 The final grammar is stored in `SamplingParams.grammar` and consumed by the runner.
@@ -44,7 +54,7 @@ The server composes:
 
 The result is a single `TopLevelGrammar` assigned to `SamplingParams.grammar`.
 
-If no client-supplied constraint grammar exists, `params.grammar` stays `None`.
+If no constraint grammar and no tool grammar exist, `params.grammar` stays `None`.
 
 ### 3. Sampling in runner
 
@@ -84,22 +94,25 @@ Legacy fields
 - `constraint`
 - `constraint_type = regex | lark | json_schema | json`
 
-If a request provides none of the above, guided decoding is not enabled.
+If a request provides none of the above, guided decoding is not enabled unless tool grammar synthesis adds one.
+
+The grammar composition logic is in `src/utils/guidance_grammar.rs`. The `GrammarRequestDispatcher` and `GrammarComposer` handle the composition of constraint grammars with reasoning grammars.
 
 ### Claude server
 
+Claude reuses the same tool-grammar builder path.
+
 Current state:
 - Claude does not expose the same client-supplied grammar request surface as the OpenAI endpoint
 - Claude reasoning is still driven by explicit thinking behavior, not by `reasoning_effort` grammar composition
-- Claude requests therefore do not currently enable guided decoding
 
 ## Reasoning Effort
 
 Reasoning effort is separate from ordinary structured outputs.
 
 ### Current state
 
-The OpenAI path maps `reasoning_effort` into grammar composition when a request constraint exists.
+The OpenAI path maps `reasoning_effort` into grammar composition.
 
 Accepted values come from `ReasoningEffort::from_str`:
 - `none`
@@ -115,7 +128,7 @@ Non-Python builds also support:
 - `custom:<template>`
 
 Relevant code:
-- `src/utils/reasoning.rs`
+- `src/utils/guidance_grammar.rs`
 - `src/server/server.rs`
 - `src/utils/guidance.rs`
 
@@ -124,7 +137,6 @@ Relevant code:
 Reasoning effort:
 - does not enable chat-template thinking by itself
 - only affects grammar composition
-- is ignored when no request constraint grammar is present
 - only works when reasoning start/end tokens are available
 
 If the tokenizer does not expose reasoning markers, the system logs a warning and falls back to the base grammar.
@@ -160,7 +172,7 @@ xinfer --m Qwen/Qwen3.5-35B-A3B-FP8/ --ui-server --d 0
 | Enforce text pattern | `structured_outputs` or `constraint` | `regex` |
 | Enforce full object schema | `structured_outputs` or `response_format` | `json` / `json_schema` |
 | Enforce custom grammar | `structured_outputs` or `constraint` | `grammar` / `lark` |
-| Constrain tagged structured output payload | `structured_outputs` | `structural_tag` |
+| Constrain tool call payload | `structured_outputs` or automatic tool grammar | `structural_tag` / tool grammar |
 | Add a reasoning prefix | `reasoning_effort` | `low`, `medium`, `high`, etc. |
 
 ### 1. Constrain the answer to a fixed set
@@ -470,5 +482,5 @@ Check:
 
 - Guided decoding is only active when `SamplingParams.grammar` is present.
 - OpenAI currently has the richest client-facing grammar surface.
-- Claude does not currently expose request-level guided decoding.
+- Claude currently reuses tool grammar, but not the same direct constraint request API.
 - No request-level grammar means no guided decoding.
diff --git a/src/api.rs b/src/api.rs
@@ -167,6 +167,8 @@ impl EngineBuilder {
             false,
             false,
             None,
+            false,
+            false,
         );
 
         if let Some(kv_dtype) = self.kvcache_dtype {
-Original file line number
+Diff line change
@@ Expand Up / @@ -167,6 +167,8 @@ impl EngineBuilder { @@
                 false,
                 false,
                 None,
+                false,
+                false,
             );
             if let Some(kv_dtype) = self.kvcache_dtype {
@@ Expand Down @@