Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
05c7b10
Implement Constrained Generation via LLGuidance
Mar 2, 2026
9a1ab79
Support Qwen3.5 Dense models on Metal (#258)
guoqingbao Mar 8, 2026
72e914f
Utilize SpecialTokens Idiomatic Accessor for EOS
Mar 7, 2026
64d9397
Idiomatic SpecialTokens Access Pattern
Mar 9, 2026
2956d5c
More SpecialTokens, Improve Example/Binary
Mar 9, 2026
0eda75b
SpecialTokens Strings for Llama4 and Qwen3.5 MoE
Mar 9, 2026
dbb961d
ToolConfig Population w/ SpecialTokens
Mar 9, 2026
621b3a6
Drop ToolFormat
Mar 9, 2026
6c0a353
Lead The Horse to Water, Make Him <Think>
Mar 10, 2026
f71d7fe
Tier Reasoning Effort
Mar 10, 2026
558b71f
Anchor XML Tool-Grammar With SpecialTokens Pads
Mar 10, 2026
b3f72f2
Merge remote-tracking branch 'origin/main' into reasoning/pr
guoqingbao Mar 12, 2026
fd53bc6
Cargo fmt
guoqingbao Mar 12, 2026
786fd24
Refactor guided decoding
guoqingbao Mar 12, 2026
0612ddf
Update docs
guoqingbao Mar 12, 2026
6c80f26
Typo fix
guoqingbao Mar 12, 2026
2b02393
Remove tool grammar & fix slow first token response for sync request
guoqingbao Mar 12, 2026
7f70675
Strip guided-decoding’s leftover tool grammar surface
guoqingbao Mar 14, 2026
0157c41
Fix incorrect guidance application
guoqingbao Mar 14, 2026
89341b3
Revert changes for scheduler.rs (tool call related)
guoqingbao Mar 14, 2026
ba46d78
Apply per-sequence guided decoding
guoqingbao Mar 14, 2026
4997457
Remove redundancy
guoqingbao Mar 14, 2026
60e88e0
Fix corner case
guoqingbao Mar 14, 2026
6d8daae
Permit empty tool call result
guoqingbao Mar 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "vllm-rs"
version = "0.9.9"
version = "0.9.10"
edition = "2021"
default-run = "vllm-rs"
description = "A minimal, high-performance large language model (LLM) inference engine implementing vLLM in Rust."
Expand All @@ -21,7 +21,8 @@ itertools = "0.13.0"
akin = "0.4.0"
indicatif = "0.17.11"
serde_json = "1.0.108"
llguidance = "0.6"
llguidance = { version = "1.6", default-features = false, features = ["lark"] }
toktrie_hf_tokenizers = "1.6"
toktrie = "1.4"
half = { version = "2.5.0", features = ["num-traits", "use-intrinsics", "rand_distr"] }
tokio = { version = "1.38.0", features = ["sync"] }
Expand Down
13 changes: 12 additions & 1 deletion ReadMe-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@
- [Docker构建](docs/docker.md)
- [工具调用解析](docs/tool_parsing.md)
- [MCP集成与工具调用](docs/mcp_tool_calling.md)
- [结构化输出文档](docs/llguidance-integration.md)
- [Claude Code使用vLLM.rs后端](docs/claude_code.md)
- [OpenCode使用vLLM.rs后端](docs/open_code.md)
- [Goose AI Agent使用vLLM.rs后端](docs/goose.md)
Expand Down Expand Up @@ -307,8 +308,18 @@ cargo install --features metal
</details>

---
## 🔌 MCP集成 (工具调用)

## 🔌 LLGuidance 支持(结构化输出与约束)

vLLM.rs 现在支持通过 llguidance 库实现结构化输出和约束生成:

- **自定义约束**:允许客户端通过 structured_outputs 或 response_format 提交 Lark/Regex/JSON Schema 约束

查看 [**结构化输出文档 →**](docs/llguidance-integration.md)

---

## 🔌 MCP集成 (工具调用)
通过Model Context Protocol让LLM调用外部工具。

```bash
Expand Down
14 changes: 12 additions & 2 deletions ReadMe.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ All models support hardware FP8 KV-cache acceleration (requires SM90+ and disabl
- [Docker Build](docs/docker.md)
- [Tool Parsing](docs/tool_parsing.md)
- [MCP Integration and Tool Calling](docs/mcp_tool_calling.md)
- [Structured Outputs](docs/llguidance-integration.md)
- [Work with Claude Code](docs/claude_code.md)
- [Work with OpenCode](docs/opencode.md)
- [Embedding](docs/embeddings.md)
Expand Down Expand Up @@ -275,7 +276,7 @@ Use `--i` to enable interactive mode 🤖, `--ui-server` or `--server` to enable
# Metal/MacOS
vllm-rs --m Qwen/Qwen3-4B-GGUF --f Qwen3-4B-Q4_K_M.gguf --ui-server --prefix-cache
```

<details open>
<summary>Multi-GPU + Unquantized Model</summary>

Expand Down Expand Up @@ -323,6 +324,15 @@ vllm-rs --m Qwen/Qwen3.5-4B-FP8 --ui-server --prefix-cache

---

## 🔌 Guided decoding (Structured Outputs & Constraints)
vLLM.rs now supports structured output and constraint-based generation via llguidance:

- **Custom Constraints**: allow clients to submit Lark/Regex/JSON Schema constraints via OpenAI-compatible structured_outputs/response_format

See [**Structured Outputs Documentation →**](docs/llguidance-integration.md)

---

## 🔌 MCP Integration (Tool Calling)

Enable LLMs to call external tools via Model Context Protocol.
Expand Down Expand Up @@ -425,7 +435,7 @@ PD Disaggregation separates prefill (prompt processing) and decode (token genera

## 📽️ Demo Video

Watch it in action 🎉
Watch it in action 🎉

<video src="https://github.com/user-attachments/assets/7fc6aa0b-78ac-4323-923f-d761dd12857f" width="1000px"></video>

Expand Down
16 changes: 7 additions & 9 deletions docs/goose.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,35 +17,33 @@ python3 -m vllm_rs.server --m Qwen/Qwen3-30B-A3B-Instruct-2507 --d 0,1 --server

## 2) Configure Goose

### Download and install Goose: https://block.github.io/goose/docs/getting-started/installation/

```shell
# For non-UI system,
export GOOSE_DISABLE_KEYRING=1
```

Export empty API KEY

```shell
export VLLM_API_KEY="empty"
```

### Download and install Goose: https://block.github.io/goose/docs/getting-started/installation/

### Configure goose with `Custom Providers` and API key `empty`

```shell
goose configure

┌ goose-configure
┌ goose-configure
◇ What would you like to configure?
│ Custom Providers
│ Custom Providers
◇ What would you like to do?
│ Add A Custom Provider
│ Add A Custom Provider
◇ What type of API is this?
│ OpenAI Compatible
│ OpenAI Compatible
◇ What should we call this provider?
│ vllm-rs
Expand All @@ -60,10 +58,10 @@ goose configure
│ default
◇ Does this provider support streaming responses?
│ Yes
│ Yes
◇ Does this provider require custom headers?
│ No
│ No
└ Custom provider added: vllm-rs
└ Configuration saved successfully to /root/.config/goose/config.yaml
Expand Down
Loading