Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6d6f085
feat: Add Anthropic Messages API endpoint with tool call parser fallback
janhilgard Feb 6, 2026
a158cef
Fix VRAM spike at end of generation via lazy-eval-safe cache handling
janhilgard Feb 6, 2026
af46e0d
Add mid-prefill cache saving, disconnect detection, and cancellation …
janhilgard Feb 6, 2026
3b6f756
Fix lint: rename ambiguous variable `l`, remove unused import
janhilgard Feb 6, 2026
dd00521
Fix mid-prefill cache restore: convert BatchKVCache to KVCache
janhilgard Feb 6, 2026
44b1cea
Fix cache fetch: prefer supersequence match over prefix match
janhilgard Feb 6, 2026
ae687bb
Add prefix cache sharing for requests with common prefix
janhilgard Feb 6, 2026
49ef8e6
Fix Metal SIGABRT crash when aborting chunked prefill mid-stream
janhilgard Feb 6, 2026
21e282e
Abort chunked prefill immediately on client disconnect
janhilgard Feb 7, 2026
aa97fe4
Fix streaming tool call parsing: emit structured delta.tool_calls
janhilgard Feb 7, 2026
3452b44
Fix nested JSON serialization in Nemotron XML tool parser
janhilgard Feb 7, 2026
0ce82e9
Fix prompt cache for hybrid Mamba+Transformer models
janhilgard Feb 7, 2026
06b7af4
Add prefix-subset eviction to reduce cache memory ~6x
janhilgard Feb 7, 2026
e9791ef
Fix multi-turn tool calling: enable native tool format for Hermes parser
janhilgard Feb 7, 2026
c714334
Fix <|im_end|> leaking into streaming OpenAI responses
janhilgard Feb 7, 2026
e7cf750
Fix stop token leaking as text: skip decoding EOS tokens in streaming
janhilgard Feb 7, 2026
65c5830
Fix prompt cache for chunked prefill (enables Anthropic endpoint cach…
janhilgard Feb 7, 2026
43ce701
Stop clearing prefix cache on sampler param changes
janhilgard Feb 7, 2026
b921a5e
Fix crash loop on cache shape mismatch after BatchGenerator recreation
janhilgard Feb 7, 2026
0ec2faf
Fix prefix cache for agentic multi-turn on hybrid Mamba+Transformer m…
janhilgard Feb 7, 2026
3f8b006
Optimize inference: bisect cache lookup, remove deepcopy, reduce clea…
janhilgard Feb 8, 2026
52977e9
Add GET /v1/status endpoint with real-time per-request monitoring
janhilgard Feb 8, 2026
ccba569
Merge origin/main into feature/anthropic-endpoint
janhilgard Feb 8, 2026
25557e2
fix: update native tool format test for HermesToolParser
janhilgard Feb 8, 2026
34e2e93
fix: add missing --default-temperature and --default-top-p CLI args
janhilgard Feb 8, 2026
a7ecc45
Pass request context to tool parsers for tool name validation
waybarrios Feb 8, 2026
2902742
Add Anthropic endpoint docs, tests, and CI integration
waybarrios Feb 8, 2026
2f06a7a
Fix black formatting in test_anthropic_adapter
waybarrios Feb 8, 2026
50ca1f0
Expand Anthropic endpoint and status docs with full examples
waybarrios Feb 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ jobs:
tests/test_api_models.py \
tests/test_api_utils.py \
tests/test_request.py \
tests/test_anthropic_models.py \
tests/test_anthropic_adapter.py \
-v --tb=short \
-k "not Integration and not InjectJson and not TestMLXMultimodalLMCache" \
--cov=vllm_mlx \
Expand Down
384 changes: 384 additions & 0 deletions docs/guides/server.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,390 @@ GET /health

Returns server status.

### Anthropic Messages API

```bash
POST /v1/messages
```

Anthropic-compatible endpoint that allows tools like Claude Code and OpenCode to connect directly to vllm-mlx. Internally it translates Anthropic requests to OpenAI format, runs inference through the engine, and converts the response back to Anthropic format.

Capabilities:
- Non-streaming and streaming responses (SSE)
- System messages (plain string or list of content blocks)
- Multi-turn conversations with user and assistant messages
- Tool calling with `tool_use` / `tool_result` content blocks
- Token counting for budget tracking
- Multimodal content (images via `source` blocks)
- Client disconnect detection (returns HTTP 499)
- Automatic special token filtering in streamed output

#### Non-streaming

```python
from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

response = client.messages.create(
model="default",
max_tokens=256,
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)
# Response includes: response.id, response.model, response.stop_reason,
# response.usage.input_tokens, response.usage.output_tokens
```

#### Streaming

Streaming follows the Anthropic SSE event protocol. Events are emitted in this order:
`message_start` -> `content_block_start` -> `content_block_delta` (repeated) -> `content_block_stop` -> `message_delta` -> `message_stop`

```python
with client.messages.stream(
model="default",
max_tokens=256,
messages=[{"role": "user", "content": "Tell me a story"}]
) as stream:
for text in stream.text_stream:
print(text, end="")
```

#### System messages

System messages can be a plain string or a list of content blocks:

```python
# Plain string
response = client.messages.create(
model="default",
max_tokens=256,
system="You are a helpful coding assistant.",
messages=[{"role": "user", "content": "Write a hello world in Python"}]
)

# List of content blocks
response = client.messages.create(
model="default",
max_tokens=256,
system=[
{"type": "text", "text": "You are a helpful assistant."},
{"type": "text", "text": "Be concise in your answers."},
],
messages=[{"role": "user", "content": "What is 2+2?"}]
)
```

#### Tool calling

Define tools with `name`, `description`, and `input_schema`. The model returns `tool_use` content blocks when it wants to call a tool. Send results back as `tool_result` blocks.

```python
# Step 1: Send request with tools
response = client.messages.create(
model="default",
max_tokens=1024,
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[{
"name": "get_weather",
"description": "Get weather for a city",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}]
)

# Step 2: Check if model wants to use tools
for block in response.content:
if block.type == "tool_use":
print(f"Tool: {block.name}, Input: {block.input}, ID: {block.id}")
# response.stop_reason will be "tool_use"

# Step 3: Send tool result back
response = client.messages.create(
model="default",
max_tokens=1024,
messages=[
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "assistant", "content": response.content},
{"role": "user", "content": [
{
"type": "tool_result",
"tool_use_id": block.id,
"content": "Sunny, 22C"
}
]}
],
tools=[{
"name": "get_weather",
"description": "Get weather for a city",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}]
)
print(response.content[0].text) # "The weather in Paris is sunny, 22C."
```

Tool choice modes:

| `tool_choice` | Behavior |
|---------------|----------|
| `{"type": "auto"}` | Model decides whether to call tools (default) |
| `{"type": "any"}` | Model must call at least one tool |
| `{"type": "tool", "name": "get_weather"}` | Model must call the specified tool |
| `{"type": "none"}` | Model will not call any tools |

#### Multi-turn conversations

```python
messages = [
{"role": "user", "content": "My name is Alice."},
{"role": "assistant", "content": "Nice to meet you, Alice!"},
{"role": "user", "content": "What's my name?"},
]

response = client.messages.create(
model="default",
max_tokens=100,
messages=messages
)
```

#### Token counting

```bash
POST /v1/messages/count_tokens
```

Counts input tokens for an Anthropic request using the model's tokenizer. Useful for budget tracking before sending a request. Counts tokens from system messages, conversation messages, tool_use inputs, tool_result content, and tool definitions (name, description, input_schema).

```python
import requests

resp = requests.post("http://localhost:8000/v1/messages/count_tokens", json={
"model": "default",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"system": "You are helpful.",
"tools": [{
"name": "search",
"description": "Search the web",
"input_schema": {"type": "object", "properties": {"q": {"type": "string"}}}
}]
})
print(resp.json()) # {"input_tokens": 42}
```

#### curl examples

Non-streaming:

```bash
curl http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Hello!"}]
}'
```

Streaming:

```bash
curl http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"max_tokens": 256,
"stream": true,
"messages": [{"role": "user", "content": "Tell me a joke"}]
}'
```

Token counting:

```bash
curl http://localhost:8000/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# {"input_tokens": 12}
```

#### Request fields

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `model` | string | yes | - | Model name (use `"default"` for the loaded model) |
| `messages` | list | yes | - | Conversation messages with `role` and `content` |
| `max_tokens` | int | yes | - | Maximum number of tokens to generate |
| `system` | string or list | no | null | System prompt (string or list of `{"type": "text", "text": "..."}` blocks) |
| `stream` | bool | no | false | Enable SSE streaming |
| `temperature` | float | no | 0.7 | Sampling temperature (0.0 = deterministic, 1.0 = creative) |
| `top_p` | float | no | 0.9 | Nucleus sampling threshold |
| `top_k` | int | no | null | Top-k sampling |
| `stop_sequences` | list | no | null | Sequences that stop generation |
| `tools` | list | no | null | Tool definitions with `name`, `description`, `input_schema` |
| `tool_choice` | dict | no | null | Tool selection mode (`auto`, `any`, `tool`, `none`) |
| `metadata` | dict | no | null | Arbitrary metadata (passed through, not used by server) |

#### Response format

Non-streaming response:

```json
{
"id": "msg_abc123...",
"type": "message",
"role": "assistant",
"model": "default",
"content": [
{"type": "text", "text": "Hello! How can I help?"}
],
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 12,
"output_tokens": 8
}
}
```

When tools are called, `content` includes `tool_use` blocks and `stop_reason` is `"tool_use"`:

```json
{
"content": [
{"type": "text", "text": "Let me check the weather."},
{
"type": "tool_use",
"id": "call_abc123",
"name": "get_weather",
"input": {"city": "Paris"}
}
],
"stop_reason": "tool_use"
}
```

Stop reasons:

| `stop_reason` | Meaning |
|---------------|---------|
| `end_turn` | Model finished naturally |
| `tool_use` | Model wants to call a tool |
| `max_tokens` | Hit the `max_tokens` limit |

#### Using with Claude Code

Point Claude Code directly at your vllm-mlx server:

```bash
# Start the server
vllm-mlx serve mlx-community/Qwen3-Coder-Next-235B-A22B-4bit \
--continuous-batching \
--enable-auto-tool-choice \
--tool-call-parser hermes

# In another terminal, configure Claude Code
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
```

### Server Status

```bash
GET /v1/status
```

Real-time monitoring endpoint that returns server-wide statistics and per-request details. Useful for debugging performance, tracking cache efficiency, and monitoring Metal GPU memory.

```bash
curl -s http://localhost:8000/v1/status | python -m json.tool
```

Example response:

```json
{
"status": "running",
"model": "mlx-community/Qwen3-8B-4bit",
"uptime_s": 342.5,
"steps_executed": 1247,
"num_running": 1,
"num_waiting": 0,
"total_requests_processed": 15,
"total_prompt_tokens": 28450,
"total_completion_tokens": 3200,
"metal": {
"active_memory_gb": 5.2,
"peak_memory_gb": 8.1,
"cache_memory_gb": 2.3
},
"cache": {
"type": "memory_aware_cache",
"entries": 5,
"hit_rate": 0.87,
"memory_mb": 2350
},
"requests": [
{
"request_id": "req_abc123",
"phase": "generation",
"tokens_per_second": 45.2,
"ttft_s": 0.8,
"progress": 0.35,
"cache_hit_type": "prefix",
"cached_tokens": 1200,
"generated_tokens": 85,
"max_tokens": 256
}
]
}
```

Response fields:

| Field | Description |
|-------|-------------|
| `status` | Server state: `running`, `stopped`, or `not_loaded` |
| `model` | Name of the loaded model |
| `uptime_s` | Seconds since the server started |
| `steps_executed` | Total inference steps executed |
| `num_running` | Number of requests currently generating tokens |
| `num_waiting` | Number of requests queued for prefill |
| `total_requests_processed` | Total requests completed since startup |
| `total_prompt_tokens` | Total prompt tokens processed since startup |
| `total_completion_tokens` | Total completion tokens generated since startup |
| `metal.active_memory_gb` | Current Metal GPU memory in use (GB) |
| `metal.peak_memory_gb` | Peak Metal GPU memory usage (GB) |
| `metal.cache_memory_gb` | Metal cache memory usage (GB) |
| `cache` | Cache statistics (type, entries, hit rate, memory usage) |
| `requests` | List of active requests with per-request details |

Per-request fields in `requests`:

| Field | Description |
|-------|-------------|
| `request_id` | Unique request identifier |
| `phase` | Current phase: `queued`, `prefill`, or `generation` |
| `tokens_per_second` | Generation throughput for this request |
| `ttft_s` | Time to first token (seconds) |
| `progress` | Completion percentage (0.0 to 1.0) |
| `cache_hit_type` | Cache match type: `exact`, `prefix`, `supersequence`, `lcp`, or `miss` |
| `cached_tokens` | Number of tokens served from cache |
| `generated_tokens` | Tokens generated so far |
| `max_tokens` | Maximum tokens requested |

## Tool Calling

Enable OpenAI-compatible tool calling with `--enable-auto-tool-choice`:
Expand Down
Loading
Loading