Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
97abd27
[docs] Convert serve/llm/quick-start.rst to MyST Markdown
kouroshHakha Oct 16, 2025
5b02921
[docs] Phase 1: Create LLM docs directory structure
kouroshHakha Oct 16, 2025
0e0ee93
[docs] Add examples.md for LLM documentation
kouroshHakha Oct 16, 2025
90821cb
[docs] Add troubleshooting.md with FAQs
kouroshHakha Oct 16, 2025
32aa562
[docs] Add multi-lora user guide
kouroshHakha Oct 16, 2025
34a4e2e
[docs] Remove Multi-LoRA section from quick-start
kouroshHakha Oct 16, 2025
00c950a
[docs] Add routing policy documentation
kouroshHakha Oct 16, 2025
4e69405
[docs] Add navigation structure for LLM docs
kouroshHakha Oct 16, 2025
e8aabdb
[docs] Apply Ray docs style and rename routing images
kouroshHakha Oct 16, 2025
5400efc
[docs] Remove old prefix-aware-request-router.md
kouroshHakha Oct 16, 2025
7f831dd
[docs] Add prefill/decode disaggregation user guide
kouroshHakha Oct 16, 2025
869f476
[docs] Add vLLM compatibility user guide
kouroshHakha Oct 16, 2025
31bad71
wip
kouroshHakha Oct 16, 2025
27b3c60
[docs] Add model loading user guide
kouroshHakha Oct 16, 2025
ccfd38e
[docs] Remove extracted content from quick-start
kouroshHakha Oct 16, 2025
7d09a33
wip
kouroshHakha Oct 16, 2025
f408ba7
wip
kouroshHakha Oct 16, 2025
5942202
[docs] Clean up .gitkeep files and remove empty api/ directory
kouroshHakha Oct 16, 2025
0be378e
wip
kouroshHakha Oct 16, 2025
9c21149
Merge branch 'master' into kh/llm-docs-phase-1
kouroshHakha Oct 16, 2025
2a04e40
[docs] Add observability guide and refine documentation overview
kouroshHakha Oct 17, 2025
31f4ea7
Apply suggestions from code review
kouroshHakha Oct 17, 2025
c622ec6
Apply suggestions from code review
kouroshHakha Oct 17, 2025
6c35085
fix comments
kouroshHakha Oct 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Example: Basic NIXLConnector configuration for prefill/decode disaggregation
# nixl_config.yaml

applications:
- args:
Expand Down
10 changes: 10 additions & 0 deletions doc/source/serve/llm/architecture/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Architecture

Technical details and design documentation for Ray Serve LLM.

```{toctree}
:maxdepth: 1

Request routing <routing-policies>
```

239 changes: 239 additions & 0 deletions doc/source/serve/llm/architecture/routing-policies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
# Request routing

Ray Serve LLM provides customizable request routing to optimize request distribution across replicas for different workload patterns. Request routing operates at the **replica selection level**, distinct from ingress-level model routing.

## Routing versus ingress

You need to distinguish between two levels of routing:

**Ingress routing** (model-level):
- Maps `model_id` to deployment
- Example: `OpenAiIngress` gets `/v1/chat/completions` with `model="gptoss"` and maps it to the `gptoss` deployment.

**Request routing** (replica-level):
- Chooses which replica to send the request to
- Example: The `gptoss` deployment handle inside the `OpenAiIngress` replica decides which replica of the deployment (1, 2, or 3) to send the request to.

This document focuses on **request routing** (replica selection).

```
HTTP Request → Ingress (model routing) → Request Router (replica selection) → Server Replica
```

## Request routing architecture

Ray Serve LLM request routing operates at the deployment handle level:

```
┌──────────────┐
│ Ingress │
│ (Replica 1) │
└──────┬───────┘
│ handle.remote(request)
┌──────────────────┐
│ Deployment Handle│
│ + Router │ ← Request routing happens here
└──────┬───────────┘
│ Chooses replica based on policy
┌───┴────┬────────┬────────┐
│ │ │ │
┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐
│ LLM │ │ LLM │ │ LLM │ │ LLM │
│ 1 │ │ 2 │ │ 3 │ │ 4 │
└─────┘ └─────┘ └─────┘ └─────┘
```

## Available routing policies

Ray Serve LLM provides multiple request routing policies to optimize for different workload patterns:

### Default routing: Power of Two Choices

The default router uses the Power of Two Choices algorithm to:
1. Randomly sample two replicas
2. Route to the replica with fewer ongoing requests

This provides good load balancing with minimal coordination overhead.

### Prefix-aware routing

The `PrefixCacheAffinityRouter` optimizes for workloads with shared prefixes by routing requests with similar prefixes to the same replicas. This improves KV cache hit rates in vLLM's Automatic Prefix Caching (APC).

The routing strategy:
1. **Check load balance**: If replicas are balanced (queue difference < threshold), use prefix matching
2. **High match rate (≥10%)**: Route to replicas with highest prefix match
3. **Low match rate (<10%)**: Route to replicas with lowest cache utilization
4. **Fallback**: Use Power of Two Choices when load is imbalanced

For more details, see {ref}`prefix-aware-routing-guide`.

## Design patterns for custom routing policies

Customizing request routers is a feature in Ray Serve's native APIs that you can define per deployment. For each deployment, you can customize the routing logic that executes every time you call `.remote()` on the deployment handle from a caller. Because deployment handles are globally available objects across the cluster, you can call them from any actor or task in the Ray cluster. For more details on this API, see {ref}`custom-request-router-guide`.

This allows you to run the same routing logic even if you have multiple handles. The default request router in Ray Serve is Power of Two Choices, which balances load equalization and prioritizes locality routing. However, you can customize this to use LLM-specific metrics.

Ray Serve LLM includes prefix-aware routing in the framework. There are two common architectural patterns for customizing request routers. There are clear trade-offs between them, so choose the suitable one and balance simplicity with performance:

### Pattern 1: Centralized singleton metric store

In this approach, you keep a centralized metric store (for example, a singleton actor) for tracking routing-related information. The request router logic physically runs on the process that owns the deployment handle, so there can be many such processes. Each one can query the singleton actor, creating a multi-tenant actor that provides a consistent view of the cluster state to the request routers.

The single actor can provide atomic thread-safe operations such as `get()` for querying the global state and `set()` for updating the global state, which the router can use during `choose_replicas()` and `on_request_routed()`.

```
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Ingress │────►│ Metric │◄────│ Ingress │
│ 1 │ │ Store │ │ 2 │
└────┬────┘ └─────────┘ └────┬────┘
│ │
└────────────────┬──────────────┘
┌──────────┴──────────┐
│ │
┌────▼────┐ ┌────▼────┐
│ LLM │ │ LLM │
│ Server │ │ Server │
└─────────┘ └─────────┘
```


```{figure} ../images/routing_centralized_store.png
---
width: 600px
name: centralized_metric_store_pattern
---
Centralized metric store pattern for custom routing
```

**Pros:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Pros:**
#### Pros

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting of this wouldn't look good?

- Simple implementation - no need to modify deployment logic for recording replica statistics
- Request metrics are immediately available
- Strong consistency guarantees

**Cons:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Cons:**
#### Cons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as pros

- A single actor can become a bottleneck in high-throughput applications where TTFT is impacted by the RPC call (~1000s of requests/s)
- Requires an additional network hop for every routing decision

### Pattern 2: Metrics broadcasted from Serve controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Pattern 2: Metrics broadcasted from Serve controller
### The Serve controller broadcasts metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


In this approach, the Serve controller polls each replica for local statistics and then broadcasts them to all request routers on their deployment handles. The request router can then use this globally broadcasted information to pick the right replica. After a request reaches the replica, the replica updates its local statistics so it can send them back to the Serve controller when the controller polls it next time.

```
┌──────────────┐
│ Serve │
│ Controller │
└──────┬───────┘
│ (broadcast)
┌─────────┴─────────┐
│ │
┌────▼────┐ ┌────▼────┐
│ Ingress │ │ Ingress │
│ +Cache │ │ +Cache │
└────┬────┘ └────┬────┘
│ │
└────────┬──────────┘
┌──────┴──────┐
│ │
┌────▼────┐ ┌────▼────┐
│ LLM │ │ LLM │
│ Server │ │ Server │
└─────────┘ └─────────┘
```


```{figure} ../images/routing_broadcast_metrics.png
---
width: 600px
name: broadcast_metrics_pattern
---
Broadcast metrics pattern for custom routing
```

**Pros:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Pros:**
#### Pros

- Scalable to higher throughput
- No additional RPC overhead per routing decision
- Distributed routing decision making

**Cons:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Cons:**
#### Cons

- Time lag between the request router's view of statistics and the ground truth state of the replicas
- Eventual consistency - routers may base decisions on slightly stale data
- More complex implementation requiring coordination with the Serve controller


- **Use Pattern 1 (Centralized store)** when you need strong consistency, have moderate throughput requirements, or want simpler implementation
- **Use Pattern 2 (Broadcast metrics)** when you need very high throughput, can tolerate eventual consistency, or want to minimize per-request overhead

## Custom routing policies

You can implement custom routing policies by extending Ray Serve's [`RequestRouter`](../../api/doc/ray.serve.request_router.RequestRouter.rst) base class. For detailed examples and step-by-step guides on implementing custom routers, see {ref}`custom-request-router-guide`.

Key methods to implement:
- [`choose_replicas()`](../../api/doc/ray.serve.request_router.RequestRouter.choose_replicas.rst): Select which replicas should handle a request
- [`on_request_routed()`](../../api/doc/ray.serve.request_router.RequestRouter.on_request_routed.rst): Update the router state after a request is routed
- [`on_replica_actor_died()`](../../api/doc/ray.serve.request_router.RequestRouter.on_replica_actor_died.rst): Clean up the state when a replica dies

### Utility mixins

Ray Serve provides mixin classes that add common functionality to routers. See the {ref}`custom-request-router-guide` for examples:

- [`LocalityMixin`](../../api/doc/ray.serve.request_router.LocalityMixin.rst): Prefers replicas on the same node to reduce network latency
- [`MultiplexMixin`](../../api/doc/ray.serve.request_router.MultiplexMixin.rst): Tracks which models are loaded on each replica for LoRA deployments
- [`FIFOMixin`](../../api/doc/ray.serve.request_router.FIFOMixin.rst): Ensures FIFO ordering of requests



### Router lifecycle

The typical lifecycle of request routers includes the following stages:

1. **Initialization**: Router created with list of replicas
2. **Request routing**: `choose_replicas()` called for each request
3. **Callback**: `on_request_routed()` called after successful routing
4. **Replica failure**: `on_replica_actor_died()` called when replica dies
5. **Cleanup**: Router cleaned up when deployment is deleted

#### Async operations

Routers should use async operations for best performance, for example:

```python
# Recommended pattern: Async operation
async def choose_replicas(self, ...):
state = await self.state_actor.get.remote()
return self._select(state)

# Not recommended pattern: Blocking operation
async def choose_replicas(self, ...):
state = ray.get(self.state_actor.get.remote()) # Blocks!
return self._select(state)
```

#### State management

For routers with state, use appropriate synchronization, for example:

```python
class StatefulRouter(RequestRouter):
def __init__(self):
self.lock = asyncio.Lock() # For async code
self.state = {}

async def choose_replicas(self, ...):
async with self.lock: # Protect shared state
# Update state
self.state[...] = ...
return [...]
```

## See also

- {ref}`prefix-aware-routing-guide` - user guide for deploying prefix-aware routing
- {ref}`custom-request-router-guide` - Ray Serve guide for implementing custom routers
- [`RequestRouter` API Reference](../../api/doc/ray.serve.request_router.RequestRouter.rst) - complete API documentation

16 changes: 16 additions & 0 deletions doc/source/serve/llm/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Examples

Production examples for deploying LLMs with Ray Serve.

## Tutorials

Complete end-to-end tutorials for deploying different types of LLMs:

- {doc}`Deploy a small-sized LLM <../tutorials/deployment-serve-llm/small-size-llm/README>`
- {doc}`Deploy a medium-sized LLM <../tutorials/deployment-serve-llm/medium-size-llm/README>`
- {doc}`Deploy a large-sized LLM <../tutorials/deployment-serve-llm/large-size-llm/README>`
- {doc}`Deploy a vision LLM <../tutorials/deployment-serve-llm/vision-llm/README>`
- {doc}`Deploy a reasoning LLM <../tutorials/deployment-serve-llm/reasoning-llm/README>`
- {doc}`Deploy a hybrid reasoning LLM <../tutorials/deployment-serve-llm/hybrid-reasoning-llm/README>`
- {doc}`Deploy gpt-oss <../tutorials/deployment-serve-llm/gpt-oss/README>`

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 26 additions & 48 deletions doc/source/serve/llm/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,70 +2,48 @@

# Serving LLMs

Ray Serve LLM APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.
Ray Serve LLM provides a high-performance, scalable framework for deploying Large Language Models (LLMs) in production. It specializes Ray Serve primitives for distributed LLM serving workloads, offering enterprise-grade features with OpenAI API compatibility.

## Why Ray Serve LLM?

Ray Serve LLM excels at highly distributed multi-node inference workloads:

- **Advanced parallelism strategies**: Seamlessly combine pipeline parallelism, tensor parallelism, expert parallelism, and data parallelism for models of any size.
- **Prefill-decode disaggregation**: Separates and optimizes prefill and decode phases independently for better resource utilization and cost efficiency.
- **Custom request routing**: Implements prefix-aware, session-aware, or custom routing logic to maximize cache hits and reduce latency.
- **Multi-node deployments**: Serves massive models that span multiple nodes with automatic placement and coordination.
- **Production-ready**: Has built-in autoscaling, monitoring, fault tolerance, and observability.

## Features

- ⚡️ Automatic scaling and load balancing
- 🌐 Unified multi-node multi-model deployment
- 🔌 OpenAI compatible
- 🔌 OpenAI-compatible API
- 🔄 Multi-LoRA support with shared base models
- 🚀 Engine agnostic architecture (i.e. vLLM, SGLang, etc)
- 🚀 Engine-agnostic architecture (vLLM, SGLang, etc.)
- 📊 Built-in metrics and Grafana dashboards
- 🎯 Advanced serving patterns (PD disaggregation, data parallelism)

## Requirements

```bash
pip install ray[serve,llm]>=2.43.0 vllm>=0.7.2

# Suggested dependencies when using vllm 0.7.2:
pip install xgrammar==0.1.11 pynvml==12.0.0
pip install ray[serve,llm]
```

## Key Components

The ray.serve.llm module provides two key deployment types for serving LLMs:

### LLMServer

The LLMServer sets up and manages the vLLM engine for model serving. It can be used standalone or combined with your own custom Ray Serve deployments.

### OpenAiIngress

This deployment provides an OpenAI-compatible FastAPI ingress and routes traffic to the appropriate model for multi-model services. The following endpoints are supported:

- `/v1/chat/completions`: Chat interface (ChatGPT-style)
- `/v1/completions`: Text completion
- `/v1/embeddings`: Text embeddings
- `/v1/score`: Text comparison
- `/v1/models`: List available models
- `/v1/models/{model}`: Model information

## Configuration

### LLMConfig

The LLMConfig class specifies model details such as:

- Model loading sources (HuggingFace or cloud storage)
- Hardware requirements (accelerator type)
- Engine arguments (e.g. vLLM engine kwargs)
- LoRA multiplexing configuration
- Serve auto-scaling parameters

```{toctree}
:hidden:

Quickstart <quick-start>
Prefill/Decode Disaggregation <pd-dissagregation>
Cache-aware request routing <prefix-aware-request-router>
Examples <examples>
User Guides <user-guides/index>
Architecture <architecture/index>
Troubleshooting <troubleshooting>
```

## Examples
## Next steps

- {doc}`Deploy a small-sized LLM <../tutorials/deployment-serve-llm/small-size-llm/README>`
- {doc}`Deploy a medium-sized LLM <../tutorials/deployment-serve-llm/medium-size-llm/README>`
- {doc}`Deploy a large-sized LLM <../tutorials/deployment-serve-llm/large-size-llm/README>`
- {doc}`Deploy a vision LLM <../tutorials/deployment-serve-llm/vision-llm/README>`
- {doc}`Deploy a reasoning LLM <../tutorials/deployment-serve-llm/reasoning-llm/README>`
- {doc}`Deploy a hybrid reasoning LLM <../tutorials/deployment-serve-llm/hybrid-reasoning-llm/README>`
- {doc}`Deploy gpt-oss <../tutorials/deployment-serve-llm/gpt-oss/README>`
- {doc}`Quickstart <quick-start>` - Deploy your first LLM with Ray Serve
- {doc}`Examples <examples>` - Production-ready deployment tutorials
- {doc}`User Guides <user-guides/index>` - Practical guides for advanced features
- {doc}`Architecture <architecture/index>` - Technical design and implementation details
- {doc}`Troubleshooting <troubleshooting>` - Common issues and solutions
Loading