Skip to content

Commit 5be23eb

Browse files
authored
Readmes + eks additions (#2157)
1 parent a8cb655 commit 5be23eb

File tree

7 files changed

+432
-99
lines changed

7 files changed

+432
-99
lines changed

components/backends/trtllm/README.md

Lines changed: 62 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -15,29 +15,10 @@ See the License for the specific language governing permissions and
1515
limitations under the License.
1616
-->
1717

18-
# LLM Deployment Examples using TensorRT-LLM
18+
# LLM Deployment using TensorRT-LLM
1919

2020
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
2121

22-
# User Documentation
23-
24-
- [Deployment Architectures](#deployment-architectures)
25-
- [Getting Started](#getting-started)
26-
- [Prerequisites](#prerequisites)
27-
- [Build docker](#build-docker)
28-
- [Run container](#run-container)
29-
- [Run deployment](#run-deployment)
30-
- [Single Node deployment](#single-node-deployments)
31-
- [Multinode deployment](#multinode-deployment)
32-
- [Client](#client)
33-
- [Benchmarking](#benchmarking)
34-
- [Disaggregation Strategy](#disaggregation-strategy)
35-
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
36-
- [More Example Architectures](#more-example-architectures)
37-
- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
38-
39-
# Quick Start
40-
4122
## Use the Latest Release
4223

4324
We recommend using the latest stable release of dynamo to avoid breaking changes:
@@ -50,26 +31,52 @@ You can find the latest release [here](https://github.com/ai-dynamo/dynamo/relea
5031
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
5132
```
5233

53-
## Deployment Architectures
34+
---
35+
36+
## Table of Contents
37+
- [Feature Support Matrix](#feature-support-matrix)
38+
- [Quick Start](#quick-start)
39+
- [Single Node Examples](#single-node-deployments)
40+
- [Advanced Examples](#advanced-examples)
41+
- [Disaggregation Strategy](#disaggregation-strategy)
42+
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
43+
- [Client](#client)
44+
- [Benchmarking](#benchmarking)
45+
46+
## Feature Support Matrix
47+
48+
### Core Dynamo Features
5449

55-
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
50+
| Feature | TensorRT-LLM | Notes |
51+
|---------|--------------|-------|
52+
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) || |
53+
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
54+
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) || |
55+
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | 🚧 | Planned |
56+
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | 🚧 | Planned |
57+
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
5658

57-
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving.
59+
### Large Scale P/D and WideEP Features
5860

59-
## Getting Started
61+
| Feature | TensorRT-LLM | Notes |
62+
|--------------------|--------------|-----------------------------------------------------------------------|
63+
| **WideEP** || |
64+
| **DP Rank Routing**|| |
65+
| **GB200 Support** || |
6066

61-
1. Choose a deployment architecture based on your requirements
62-
2. Configure the components as needed
63-
3. Deploy using the provided scripts
67+
## Quick Start
6468

65-
### Prerequisites
69+
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
70+
71+
### Start NATS and ETCD in the background
72+
73+
Start using [Docker Compose](../../../deploy/docker-compose.yml)
6674

67-
Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml)
6875
```bash
6976
docker compose -f deploy/docker-compose.yml up -d
7077
```
7178

72-
### Build docker
79+
### Build container
7380

7481
```bash
7582
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
@@ -89,17 +96,18 @@ apt-get update && apt-get -y install git git-lfs
8996

9097
### Run container
9198

92-
```
99+
```bash
93100
./container/run.sh --framework tensorrtllm -it
94101
```
95-
## Run Deployment
96102

97-
This figure shows an overview of the major components to deploy:
103+
## Single Node Examples
98104

105+
> [!IMPORTANT]
106+
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
99107
108+
This figure shows an overview of the major components to deploy:
100109

101110
```
102-
103111
+------+ +-----------+ +------------------+ +---------------+
104112
| HTTP |----->| processor |----->| Worker1 |------------>| Worker2 |
105113
| |<-----| |<-----| |<------------| |
@@ -111,29 +119,23 @@ This figure shows an overview of the major components to deploy:
111119
| +---------| kv-router |
112120
+------------->| |
113121
+------------------+
114-
115122
```
116123

117124
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
118125

119-
### Single-Node Deployments
120-
121-
> [!IMPORTANT]
122-
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.
123-
124-
#### Aggregated
126+
### Aggregated
125127
```bash
126128
cd $DYNAMO_HOME/components/backends/trtllm
127129
./launch/agg.sh
128130
```
129131

130-
#### Aggregated with KV Routing
132+
### Aggregated with KV Routing
131133
```bash
132134
cd $DYNAMO_HOME/components/backends/trtllm
133135
./launch/agg_router.sh
134136
```
135137

136-
#### Disaggregated
138+
### Disaggregated
137139

138140
> [!IMPORTANT]
139141
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
@@ -143,7 +145,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
143145
./launch/disagg.sh
144146
```
145147

146-
#### Disaggregated with KV Routing
148+
### Disaggregated with KV Routing
147149

148150
> [!IMPORTANT]
149151
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
@@ -153,7 +155,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
153155
./launch/disagg_router.sh
154156
```
155157

156-
#### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
158+
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
157159
```bash
158160
cd $DYNAMO_HOME/components/backends/trtllm
159161

@@ -172,21 +174,16 @@ Notes:
172174
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
173175
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
174176

175-
### Multinode Deployment
176-
177-
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
178-
179-
### Client
180-
181-
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
177+
## Advanced Examples
182178

183-
NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
179+
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
184180

185-
### Benchmarking
181+
### Multinode Deployment
186182

187-
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
188-
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
183+
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
189184

185+
### Speculative Decoding
186+
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
190187

191188
## Disaggregation Strategy
192189

@@ -221,6 +218,13 @@ indicates a request to this model may be migrated up to 3 times to another Backe
221218

222219
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
223220

224-
## More Example Architectures
221+
## Client
222+
223+
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
224+
225+
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
225226

226-
- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
227+
## Benchmarking
228+
229+
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
230+
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)

components/backends/vllm/README.md

Lines changed: 66 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -7,33 +7,81 @@ SPDX-License-Identifier: Apache-2.0
77

88
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
99

10-
## Deployment Architectures
10+
## Use the Latest Release
1111

12-
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. vLLM supports aggregated, disaggregated, and KV-routed serving patterns.
12+
We recommend using the latest stable release of Dynamo to avoid breaking changes:
1313

14-
## Getting Started
14+
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
1515

16-
### Prerequisites
16+
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
1717

18-
Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml):
18+
```bash
19+
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
20+
```
21+
22+
---
23+
24+
## Table of Contents
25+
- [Feature Support Matrix](#feature-support-matrix)
26+
- [Quick Start](#quick-start)
27+
- [Single Node Examples](#run-single-node-examples)
28+
- [Advanced Examples](#advanced-examples)
29+
- [Deploy on Kubernetes](#kubernetes-deployment)
30+
- [Configuration](#configuration)
31+
32+
## Feature Support Matrix
33+
34+
### Core Dynamo Features
35+
36+
| Feature | vLLM | Notes |
37+
|---------|------|-------|
38+
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) || |
39+
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
40+
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) || |
41+
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) || |
42+
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | 🚧 | WIP |
43+
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |
44+
45+
### Large Scale P/D and WideEP Features
46+
47+
| Feature | vLLM | Notes |
48+
|--------------------|------|-----------------------------------------------------------------------|
49+
| **WideEP** || Support for PPLX / DeepEP not verified |
50+
| **DP Rank Routing**|| Supported via external control of DP ranks |
51+
| **GB200 Support** | 🚧 | Container functional on main |
52+
53+
## Quick Start
54+
55+
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
56+
57+
### Start NATS and ETCD in the background
58+
59+
Start using [Docker Compose](../../../deploy/docker-compose.yml)
1960

2061
```bash
2162
docker compose -f deploy/docker-compose.yml up -d
2263
```
2364

24-
### Build and Run docker
65+
### Pull or build container
66+
67+
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
2568

2669
```bash
2770
./container/build.sh --framework VLLM
2871
```
2972

73+
### Run container
74+
3075
```bash
3176
./container/run.sh -it --framework VLLM [--mount-workspace]
3277
```
3378

3479
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
3580

36-
## Run Deployment
81+
## Run Single Node Examples
82+
83+
> [!IMPORTANT]
84+
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
3785
3886
This figure shows an overview of the major components to deploy:
3987

@@ -53,57 +101,55 @@ This figure shows an overview of the major components to deploy:
53101

54102
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
55103

56-
### Example Architectures
57-
58-
> [!IMPORTANT]
59-
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `dynamo run` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can run each command in separate terminals for better log visibility.
60-
61-
#### Aggregated Serving
104+
### Aggregated Serving
62105

63106
```bash
64107
# requires one gpu
65108
cd components/backends/vllm
66109
bash launch/agg.sh
67110
```
68111

69-
#### Aggregated Serving with KV Routing
112+
### Aggregated Serving with KV Routing
70113

71114
```bash
72115
# requires two gpus
73116
cd components/backends/vllm
74117
bash launch/agg_router.sh
75118
```
76119

77-
#### Disaggregated Serving
120+
### Disaggregated Serving
78121

79122
```bash
80123
# requires two gpus
81124
cd components/backends/vllm
82125
bash launch/disagg.sh
83126
```
84127

85-
#### Disaggregated Serving with KV Routing
128+
### Disaggregated Serving with KV Routing
86129

87130
```bash
88131
# requires three gpus
89132
cd components/backends/vllm
90133
bash launch/disagg_router.sh
91134
```
92135

93-
#### Single Node Data Parallel Attention / Expert Parallelism
136+
### Single Node Data Parallel Attention / Expert Parallelism
94137

95-
This example is not meant to be performant but showcases dynamo routing to data parallel workers
138+
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
96139

97140
```bash
98141
# requires four gpus
99142
cd components/backends/vllm
100143
bash launch/dep.sh
101144
```
102145

103-
104146
> [!TIP]
105147
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
106148
149+
## Advanced Examples
150+
151+
Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
152+
107153
### Kubernetes Deployment
108154

109155
For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
@@ -118,7 +164,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
118164

119165
- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
120166

121-
- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image:
167+
- **Container Images**: We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image:
122168
```bash
123169
./container/build.sh --framework VLLM
124170
# Tag and push to your container registry

0 commit comments

Comments
 (0)