Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ itertools = "0.13.0"
akin = "0.4.0"
indicatif = "0.17.11"
serde_json = "1.0.108"
llguidance = { version = "1.6", default-features = false, features = ["lark"] }
toktrie_hf_tokenizers = "1.6"
llguidance = { version = "1.7", default-features = false, features = ["lark", "referencing", "jsonschema_validation"] }
toktrie_hf_tokenizers = "1.7"
toktrie = "1.4"
half = { version = "2.5.0", features = ["num-traits", "use-intrinsics", "rand_distr"] }
tokio = { version = "1.38.0", features = ["sync"] }
Expand Down
141 changes: 141 additions & 0 deletions ReadMe-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,147 @@ xinfer --m unsloth/Qwen3.5-4B-GGUF --f Qwen3.5-4B-Q4_K_M.gguf
## 📘 使用方法
> **Python包安装后**请使用 `python3 -m xinfer.server` 方式运行

### 安装

<details>
<summary><b>CUDA(Linux)</b></summary>

```bash
# 前置依赖
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt-get install -y git build-essential libssl-dev pkg-config

# 可选:CUDA toolkit + NCCL
sudo apt-get install -y cuda-nvcc-12-9 cuda-nvrtc-dev-12-9 libcublas-dev-12-9 libcurand-dev-12-9
sudo apt-get install -y libnccl2 libnccl-dev

# 编译安装
cargo --install --features cuda,nccl,flashinfer,cutlass
# Flash Attention 后端:
cargo --install --features cuda,nccl,flashattn,cutlass
# V100 / 较老硬件(无 flash 后端):
cargo --install --features cuda,nccl
```

</details>

<details>
<summary><b>Metal(macOS)</b></summary>

```bash
# 先安装 Xcode 命令行工具
cargo install --features metal
```

</details>

默认启动 **API 服务模式**(端口 8000)。使用 `--i` 启用交互模式 🤖,`--ui-server` 启用带 Web UI 的服务模式 🌐,`--m` 指定Huggingface模型,或`--w` 指定本地Safetensors模型路径 或`--f` 指定GGUF模型文件:

> 单卡/多卡推理
<details open>
<summary>单卡推理</summary>

```bash
# CUDA (将 `--i`替换成 `--ui-server`则启用网页版本)
vllm-rs --i --m unsloth/Qwen3.5-27B-GGUF --f Qwen3.5-27B-Q4_K_M.gguf --kv-fraction 0.8
# Metal/MacOS (MacOS Tahoe之前的系统可能会存在生成过慢问题,使用更小的`--max-model-len` 或 `--kv-fraction`减少显存占用)
vllm-rs --i --m unsloth/Qwen3.5-4B-GGUF --f Qwen3.5-4B-Q3_K_M.gguf
```
</details>

<details open>
<summary>多卡未量化模型</summary>

```bash
vllm-rs --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --ui-server --prefix-cache
```
</details>

<details open>
<summary>FP8/FP4模型</summary>

_FP8格式:_
```bash
vllm-rs --d 0,1 --w /path/Qwen3-Coder-30B-A3B-Instruct-FP8/ --ui-server --prefix-cache
# Or Qwen3-Next 80B
vllm-rs --m Qwen/Qwen3-Coder-Next-FP8 --ui-server --d 0,1 --prefix-cache
```

_MXFP4格式:_
```bash
vllm-rs --m olka-fi/Qwen3.5-4B-MXFP4 --ui-server --prefix-cache
```

_NVFP4格式:_
```bash
vllm-rs --m AxionML/Qwen3.5-9B-NVFP4 --ui-server --prefix-cache
```
</details>

<details open>
<summary>多卡量化模型</summary>

```bash
vllm-rs --ui-server --d 0,1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --prefix-cache
```
</details>

<details open>
<summary>未量化模型运行为Q4K量化模型,同时使用FP8 KVCache</summary>

```bash
# 编译时去除`flashinfer` 或 `flashattn` 以使用fp8 kvcache
vllm-rs --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --server --port 8000 --fp8-kvcache
```
</details>

---

## 🔌 结构化输出与约束(Guided Decoding)

vLLM.rs 现在支持通过 llguidance 库实现结构化输出和约束生成。

### ⚠️ 安全说明

**客户端提供的约束默认被阻止。**要启用它们,您必须显式设置 `--allow-constraint-api` 标志。

#### 启用客户端约束
```bash
# 启用客户端提交的约束 via HTTP API
vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --allow-constraint-api
```

#### 客户端约束的安全风险
客户端提供的约束可能导致严重的安全漏洞:

1. **Lark 语法注入**:恶意客户端可以提交精心设计的 Lark 语法,这些语法:
- 可以访问超出用户角色边界的特殊令牌
- 注入可能导致 ReDoS 攻击的任意正则表达式模式
- 绕过聊天模板的角色分离

2. **JSON Schema 转义**:客户端可以指定:
- 引用系统不打算让用户控制的内部特殊令牌
- 创建模糊的令牌边界,导致系统指令泄露
- 注入匹配系统角色的禁止正则表达式模式

3. **角色边界 violation**:启用约束后,客户端可能:
- 逃逸聊天模板中的 `user:` 角色边界
- 注入 `system:` 或 `assistant:` 角色内容
- 操纵 tool_call 标记以注入伪造的工具响应
- 发明新的方法使设计不佳的系统超出预期范围运行

#### 推荐用法
- **生产环境**:仅与可信的访问系统/客户端一起设置 `--enable-tool-grammar` 和/或 `--allow-constraint-api`,或在 tokenizer-aware WAF 内联验证语法时过滤传入内容。

```bash
# 启用自动工具语法生成
vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --enable-tool-grammar
```

查看 [**结构化输出文档 →**](docs/llguidance-integration.md)

---

> Docker 内构建请参考 [**在 Docker 中运行 xInfer →**](docs/docker.md)

### 运行模型
Expand Down
42 changes: 42 additions & 0 deletions ReadMe.md
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,48 @@ xinfer --m mistralai/Ministral-3-3B --ui-server

---

## 🔌 Guided decoding (Structured Outputs & Constraints)
vLLM.rs now supports structured output and constraint-based generation via llguidance.

### ⚠️ Security Notice

**Client-provided constraints are BLOCKED by default.** To enable them, you must explicitly set the `--allow-constraint-api` flag.

#### Enabling Client Constraints
```bash
# Enable client-submitted constraints via HTTP API
vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --allow-constraint-api
```

#### Security Risks of Client Constraints
Client-provided constraints can introduce serious security vulnerabilities:

1. **Lark Grammar Injection**: Malicious clients can submit crafted Lark grammars that:
- Access special tokens beyond the user role boundary
- Inject arbitrary regex patterns that could cause ReDoS attacks
- Bypass the chat template's role separation

2. **JSON Schema Escapes**: Clients can specify schemas that:
- Reference internal special tokens not intended for user control
- Create ambiguous token boundaries that leak system instructions
- Inject forbidden regex patterns matching system roles

3. **Role Boundary Violations**: When constraints are enabled, clients can potentially:
- Escape the `user:` role boundary in chat templates
- Inject `system:` or `assistant:` role content
- Manipulate tool_call markers to inject fake tool responses
- Invent new ways to make poorly designed systems behave beyond intended scope

#### Recommended Usage
- **Production**: Set `--enable-tool-grammar` and/or `--allow-constraint-api` with trusted accessor systems/clients or when filtering inbound content through a tokenizer-aware WAF with grammar validation inline.

```bash
# Enable automatic tool grammar generation
vllm-rs --m Qwen/Qwen3.5-27B-FP8 --ui-server --prefix-cache --enable-tool-grammar
```

See [**Structured Outputs Documentation →**](docs/llguidance-integration.md)

## 📘 Build from source code

**Option 1 — Cargo**
Expand Down
30 changes: 21 additions & 9 deletions docs/guided_decoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,23 @@ It focuses on:
- how reasoning effort is applied
- practical usage and validation commands

## JSON Schema Reference

For detailed JSON Schema constraint documentation with curl examples, see [`llguidance-json-schema.md`](llguidance-json-schema.md).

This covers:
- Schema type definitions (string, integer, number, boolean, object, array)
- All supported API endpoints (OpenAI-compatible and Claude server)
- Complete curl examples for each permutation
- Schema sanitization behavior

## Current Model

Guided decoding is request-scoped.

The core engine does not invent grammars on its own. A request either:
- supplies a constraint grammar
- gets that constraint grammar composed with a reasoning prefix
- gets a composed grammar containing both
- or runs unconstrained when neither exists

The final grammar is stored in `SamplingParams.grammar` and consumed by the runner.
Expand Down Expand Up @@ -44,7 +54,7 @@ The server composes:

The result is a single `TopLevelGrammar` assigned to `SamplingParams.grammar`.

If no client-supplied constraint grammar exists, `params.grammar` stays `None`.
If no constraint grammar and no tool grammar exist, `params.grammar` stays `None`.

### 3. Sampling in runner

Expand Down Expand Up @@ -84,22 +94,25 @@ Legacy fields
- `constraint`
- `constraint_type = regex | lark | json_schema | json`

If a request provides none of the above, guided decoding is not enabled.
If a request provides none of the above, guided decoding is not enabled unless tool grammar synthesis adds one.

The grammar composition logic is in `src/utils/guidance_grammar.rs`. The `GrammarRequestDispatcher` and `GrammarComposer` handle the composition of constraint grammars with reasoning grammars.

### Claude server

Claude reuses the same tool-grammar builder path.

Current state:
- Claude does not expose the same client-supplied grammar request surface as the OpenAI endpoint
- Claude reasoning is still driven by explicit thinking behavior, not by `reasoning_effort` grammar composition
- Claude requests therefore do not currently enable guided decoding

## Reasoning Effort

Reasoning effort is separate from ordinary structured outputs.

### Current state

The OpenAI path maps `reasoning_effort` into grammar composition when a request constraint exists.
The OpenAI path maps `reasoning_effort` into grammar composition.

Accepted values come from `ReasoningEffort::from_str`:
- `none`
Expand All @@ -115,7 +128,7 @@ Non-Python builds also support:
- `custom:<template>`

Relevant code:
- `src/utils/reasoning.rs`
- `src/utils/guidance_grammar.rs`
- `src/server/server.rs`
- `src/utils/guidance.rs`

Expand All @@ -124,7 +137,6 @@ Relevant code:
Reasoning effort:
- does not enable chat-template thinking by itself
- only affects grammar composition
- is ignored when no request constraint grammar is present
- only works when reasoning start/end tokens are available

If the tokenizer does not expose reasoning markers, the system logs a warning and falls back to the base grammar.
Expand Down Expand Up @@ -160,7 +172,7 @@ xinfer --m Qwen/Qwen3.5-35B-A3B-FP8/ --ui-server --d 0
| Enforce text pattern | `structured_outputs` or `constraint` | `regex` |
| Enforce full object schema | `structured_outputs` or `response_format` | `json` / `json_schema` |
| Enforce custom grammar | `structured_outputs` or `constraint` | `grammar` / `lark` |
| Constrain tagged structured output payload | `structured_outputs` | `structural_tag` |
| Constrain tool call payload | `structured_outputs` or automatic tool grammar | `structural_tag` / tool grammar |
| Add a reasoning prefix | `reasoning_effort` | `low`, `medium`, `high`, etc. |

### 1. Constrain the answer to a fixed set
Expand Down Expand Up @@ -470,5 +482,5 @@ Check:

- Guided decoding is only active when `SamplingParams.grammar` is present.
- OpenAI currently has the richest client-facing grammar surface.
- Claude does not currently expose request-level guided decoding.
- Claude currently reuses tool grammar, but not the same direct constraint request API.
- No request-level grammar means no guided decoding.
2 changes: 2 additions & 0 deletions src/api.rs
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ impl EngineBuilder {
false,
false,
None,
false,
false,
);

if let Some(kv_dtype) = self.kvcache_dtype {
Expand Down
Loading