diff --git a/Cargo.toml b/Cargo.toml index 6a5c76cc..238fb8e1 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -22,8 +22,8 @@ itertools = "0.13.0" akin = "0.4.0" indicatif = "0.17.11" serde_json = "1.0.108" -llguidance = { version = "1.6", default-features = false, features = ["lark"] } -toktrie_hf_tokenizers = "1.6" +llguidance = { version = "1.7", default-features = false, features = ["lark", "referencing", "jsonschema_validation"] } +toktrie_hf_tokenizers = "1.7" toktrie = "1.4" half = { version = "2.5.0", features = ["num-traits", "use-intrinsics", "rand_distr"] } tokio = { version = "1.38.0", features = ["sync"] } diff --git a/ReadMe-CN.md b/ReadMe-CN.md index 1ef3356e..a83943d0 100644 --- a/ReadMe-CN.md +++ b/ReadMe-CN.md @@ -177,6 +177,147 @@ xinfer --m unsloth/Qwen3.5-4B-GGUF --f Qwen3.5-4B-Q4_K_M.gguf ## 📘 使用方法 > **Python包安装后**请使用 `python3 -m xinfer.server` 方式运行 +### 安装 + +
+CUDA(Linux) + +```bash +# 前置依赖 +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +sudo apt-get install -y git build-essential libssl-dev pkg-config + +# 可选:CUDA toolkit + NCCL +sudo apt-get install -y cuda-nvcc-12-9 cuda-nvrtc-dev-12-9 libcublas-dev-12-9 libcurand-dev-12-9 +sudo apt-get install -y libnccl2 libnccl-dev + +# 编译安装 +cargo --install --features cuda,nccl,flashinfer,cutlass +# Flash Attention 后端: +cargo --install --features cuda,nccl,flashattn,cutlass +# V100 / 较老硬件(无 flash 后端): +cargo --install --features cuda,nccl +``` + +
+ +
+Metal(macOS) + +```bash +# 先安装 Xcode 命令行工具 +cargo install --features metal +``` + +
+ +默认启动 **API 服务模式**(端口 8000)。使用 `--i` 启用交互模式 🤖,`--ui-server` 启用带 Web UI 的服务模式 🌐,`--m` 指定Huggingface模型,或`--w` 指定本地Safetensors模型路径 或`--f` 指定GGUF模型文件: + +> 单卡/多卡推理 +
+ 单卡推理 + + ```bash + # CUDA (将 `--i`替换成 `--ui-server`则启用网页版本) + vllm-rs --i --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf --kv-fraction 0.8 + # Metal/MacOS (MacOS Tahoe之前的系统可能会存在生成过慢问题,使用更小的`--max-model-len` 或 `--kv-fraction`减少显存占用) + vllm-rs --i --m unsloth/Qwen3.5-4B-GGUF --f Qwen3.5-4B-Q3_K_M.gguf + ``` +
+ +
+ 多卡未量化模型 + + ```bash + vllm-rs --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --ui-server --prefix-cache + ``` +
+ +
+ FP8/FP4模型 + + _FP8格式:_ + ```bash + vllm-rs --d 0,1 --w /path/Qwen3-Coder-30B-A3B-Instruct-FP8/ --ui-server --prefix-cache + # Or Qwen3-Next 80B + vllm-rs --m Qwen/Qwen3-Coder-Next-FP8 --ui-server --d 0,1 --prefix-cache + ``` + + _MXFP4格式:_ + ```bash + vllm-rs --m olka-fi/Qwen3.5-4B-MXFP4 --ui-server --prefix-cache + ``` + + _NVFP4格式:_ + ```bash + vllm-rs --m AxionML/Qwen3.5-9B-NVFP4 --ui-server --prefix-cache + ``` +
+ +
+ 多卡量化模型 + + ```bash + vllm-rs --ui-server --d 0,1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --prefix-cache + ``` +
+ +
+ 未量化模型运行为Q4K量化模型,同时使用FP8 KVCache + + ```bash + # 编译时去除`flashinfer` 或 `flashattn` 以使用fp8 kvcache + vllm-rs --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --server --port 8000 --fp8-kvcache + ``` +
+ +--- + +## 🔌 结构化输出与约束(Guided Decoding) + +vLLM.rs 现在支持通过 llguidance 库实现结构化输出和约束生成。 + +### ⚠️ 安全说明 + +**客户端提供的约束默认被阻止。**要启用它们,您必须显式设置 `--allow-constraint-api` 标志。 + +#### 启用客户端约束 +```bash +# 启用客户端提交的约束 via HTTP API +vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --allow-constraint-api +``` + +#### 客户端约束的安全风险 +客户端提供的约束可能导致严重的安全漏洞: + +1. **Lark 语法注入**:恶意客户端可以提交精心设计的 Lark 语法,这些语法: + - 可以访问超出用户角色边界的特殊令牌 + - 注入可能导致 ReDoS 攻击的任意正则表达式模式 + - 绕过聊天模板的角色分离 + +2. **JSON Schema 转义**:客户端可以指定: + - 引用系统不打算让用户控制的内部特殊令牌 + - 创建模糊的令牌边界,导致系统指令泄露 + - 注入匹配系统角色的禁止正则表达式模式 + +3. **角色边界 violation**:启用约束后,客户端可能: + - 逃逸聊天模板中的 `user:` 角色边界 + - 注入 `system:` 或 `assistant:` 角色内容 + - 操纵 tool_call 标记以注入伪造的工具响应 + - 发明新的方法使设计不佳的系统超出预期范围运行 + +#### 推荐用法 +- **生产环境**:仅与可信的访问系统/客户端一起设置 `--enable-tool-grammar` 和/或 `--allow-constraint-api`,或在 tokenizer-aware WAF 内联验证语法时过滤传入内容。 + +```bash +# 启用自动工具语法生成 +vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --enable-tool-grammar +``` + +查看 [**结构化输出文档 →**](docs/llguidance-integration.md) + +--- + > Docker 内构建请参考 [**在 Docker 中运行 xInfer →**](docs/docker.md) ### 运行模型 diff --git a/ReadMe.md b/ReadMe.md index 171fcf7f..f4d43d7c 100644 --- a/ReadMe.md +++ b/ReadMe.md @@ -290,6 +290,48 @@ xinfer --m mistralai/Ministral-3-3B --ui-server --- +## 🔌 Guided decoding (Structured Outputs & Constraints) +vLLM.rs now supports structured output and constraint-based generation via llguidance. + +### ⚠️ Security Notice + +**Client-provided constraints are BLOCKED by default.** To enable them, you must explicitly set the `--allow-constraint-api` flag. + +#### Enabling Client Constraints +```bash +# Enable client-submitted constraints via HTTP API +vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --allow-constraint-api +``` + +#### Security Risks of Client Constraints +Client-provided constraints can introduce serious security vulnerabilities: + +1. **Lark Grammar Injection**: Malicious clients can submit crafted Lark grammars that: + - Access special tokens beyond the user role boundary + - Inject arbitrary regex patterns that could cause ReDoS attacks + - Bypass the chat template's role separation + +2. **JSON Schema Escapes**: Clients can specify schemas that: + - Reference internal special tokens not intended for user control + - Create ambiguous token boundaries that leak system instructions + - Inject forbidden regex patterns matching system roles + +3. **Role Boundary Violations**: When constraints are enabled, clients can potentially: + - Escape the `user:` role boundary in chat templates + - Inject `system:` or `assistant:` role content + - Manipulate tool_call markers to inject fake tool responses + - Invent new ways to make poorly designed systems behave beyond intended scope + +#### Recommended Usage +- **Production**: Set `--enable-tool-grammar` and/or `--allow-constraint-api` with trusted accessor systems/clients or when filtering inbound content through a tokenizer-aware WAF with grammar validation inline. + +```bash +# Enable automatic tool grammar generation +vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --enable-tool-grammar +``` + +See [**Structured Outputs Documentation →**](docs/llguidance-integration.md) + ## 📘 Build from source code **Option 1 — Cargo** diff --git a/docs/guided_decoding.md b/docs/guided_decoding.md index d177bf8b..68f3fa5d 100644 --- a/docs/guided_decoding.md +++ b/docs/guided_decoding.md @@ -8,13 +8,23 @@ It focuses on: - how reasoning effort is applied - practical usage and validation commands +## JSON Schema Reference + +For detailed JSON Schema constraint documentation with curl examples, see [`llguidance-json-schema.md`](llguidance-json-schema.md). + +This covers: +- Schema type definitions (string, integer, number, boolean, object, array) +- All supported API endpoints (OpenAI-compatible and Claude server) +- Complete curl examples for each permutation +- Schema sanitization behavior + ## Current Model Guided decoding is request-scoped. The core engine does not invent grammars on its own. A request either: - supplies a constraint grammar -- gets that constraint grammar composed with a reasoning prefix +- gets a composed grammar containing both - or runs unconstrained when neither exists The final grammar is stored in `SamplingParams.grammar` and consumed by the runner. @@ -44,7 +54,7 @@ The server composes: The result is a single `TopLevelGrammar` assigned to `SamplingParams.grammar`. -If no client-supplied constraint grammar exists, `params.grammar` stays `None`. +If no constraint grammar and no tool grammar exist, `params.grammar` stays `None`. ### 3. Sampling in runner @@ -84,14 +94,17 @@ Legacy fields - `constraint` - `constraint_type = regex | lark | json_schema | json` -If a request provides none of the above, guided decoding is not enabled. +If a request provides none of the above, guided decoding is not enabled unless tool grammar synthesis adds one. + +The grammar composition logic is in `src/utils/guidance_grammar.rs`. The `GrammarRequestDispatcher` and `GrammarComposer` handle the composition of constraint grammars with reasoning grammars. ### Claude server +Claude reuses the same tool-grammar builder path. + Current state: - Claude does not expose the same client-supplied grammar request surface as the OpenAI endpoint - Claude reasoning is still driven by explicit thinking behavior, not by `reasoning_effort` grammar composition -- Claude requests therefore do not currently enable guided decoding ## Reasoning Effort @@ -99,7 +112,7 @@ Reasoning effort is separate from ordinary structured outputs. ### Current state -The OpenAI path maps `reasoning_effort` into grammar composition when a request constraint exists. +The OpenAI path maps `reasoning_effort` into grammar composition. Accepted values come from `ReasoningEffort::from_str`: - `none` @@ -115,7 +128,7 @@ Non-Python builds also support: - `custom: