You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: components/backends/trtllm/README.md
+62-58Lines changed: 62 additions & 58 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,29 +15,10 @@ See the License for the specific language governing permissions and
15
15
limitations under the License.
16
16
-->
17
17
18
-
# LLM Deployment Examples using TensorRT-LLM
18
+
# LLM Deployment using TensorRT-LLM
19
19
20
20
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving.
This figure shows an overview of the major components to deploy:
103
+
## Single Node Examples
98
104
105
+
> [!IMPORTANT]
106
+
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
99
107
108
+
This figure shows an overview of the major components to deploy:
@@ -111,29 +119,23 @@ This figure shows an overview of the major components to deploy:
111
119
| +---------| kv-router |
112
120
+------------->| |
113
121
+------------------+
114
-
115
122
```
116
123
117
124
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
118
125
119
-
### Single-Node Deployments
120
-
121
-
> [!IMPORTANT]
122
-
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.
123
-
124
-
#### Aggregated
126
+
### Aggregated
125
127
```bash
126
128
cd$DYNAMO_HOME/components/backends/trtllm
127
129
./launch/agg.sh
128
130
```
129
131
130
-
####Aggregated with KV Routing
132
+
### Aggregated with KV Routing
131
133
```bash
132
134
cd$DYNAMO_HOME/components/backends/trtllm
133
135
./launch/agg_router.sh
134
136
```
135
137
136
-
####Disaggregated
138
+
### Disaggregated
137
139
138
140
> [!IMPORTANT]
139
141
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
@@ -143,7 +145,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
143
145
./launch/disagg.sh
144
146
```
145
147
146
-
####Disaggregated with KV Routing
148
+
### Disaggregated with KV Routing
147
149
148
150
> [!IMPORTANT]
149
151
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
@@ -153,7 +155,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
153
155
./launch/disagg_router.sh
154
156
```
155
157
156
-
####Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
158
+
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
157
159
```bash
158
160
cd$DYNAMO_HOME/components/backends/trtllm
159
161
@@ -172,21 +174,16 @@ Notes:
172
174
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
173
175
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
174
176
175
-
### Multinode Deployment
176
-
177
-
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
178
-
179
-
### Client
180
-
181
-
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
177
+
## Advanced Examples
182
178
183
-
NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
179
+
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
184
180
185
-
### Benchmarking
181
+
### Multinode Deployment
186
182
187
-
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
188
-
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
183
+
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
@@ -221,6 +218,13 @@ indicates a request to this model may be migrated up to 3 times to another Backe
221
218
222
219
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
223
220
224
-
## More Example Architectures
221
+
## Client
222
+
223
+
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
224
+
225
+
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
9
9
10
-
## Deployment Architectures
10
+
## Use the Latest Release
11
11
12
-
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. vLLM supports aggregated, disaggregated, and KV-routed serving patterns.
12
+
We recommend using the latest stable release of Dynamo to avoid breaking changes:
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
17
17
18
-
Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml):
|**WideEP**| ✅ | Support for PPLX / DeepEP not verified |
50
+
|**DP Rank Routing**| ✅ | Supported via external control of DP ranks |
51
+
|**GB200 Support**| 🚧 | Container functional on main |
52
+
53
+
## Quick Start
54
+
55
+
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
56
+
57
+
### Start NATS and ETCD in the background
58
+
59
+
Start using [Docker Compose](../../../deploy/docker-compose.yml)
19
60
20
61
```bash
21
62
docker compose -f deploy/docker-compose.yml up -d
22
63
```
23
64
24
-
### Build and Run docker
65
+
### Pull or build container
66
+
67
+
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
35
80
36
-
## Run Deployment
81
+
## Run Single Node Examples
82
+
83
+
> [!IMPORTANT]
84
+
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
37
85
38
86
This figure shows an overview of the major components to deploy:
39
87
@@ -53,57 +101,55 @@ This figure shows an overview of the major components to deploy:
53
101
54
102
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
55
103
56
-
### Example Architectures
57
-
58
-
> [!IMPORTANT]
59
-
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `dynamo run` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can run each command in separate terminals for better log visibility.
60
-
61
-
#### Aggregated Serving
104
+
### Aggregated Serving
62
105
63
106
```bash
64
107
# requires one gpu
65
108
cd components/backends/vllm
66
109
bash launch/agg.sh
67
110
```
68
111
69
-
####Aggregated Serving with KV Routing
112
+
### Aggregated Serving with KV Routing
70
113
71
114
```bash
72
115
# requires two gpus
73
116
cd components/backends/vllm
74
117
bash launch/agg_router.sh
75
118
```
76
119
77
-
####Disaggregated Serving
120
+
### Disaggregated Serving
78
121
79
122
```bash
80
123
# requires two gpus
81
124
cd components/backends/vllm
82
125
bash launch/disagg.sh
83
126
```
84
127
85
-
####Disaggregated Serving with KV Routing
128
+
### Disaggregated Serving with KV Routing
86
129
87
130
```bash
88
131
# requires three gpus
89
132
cd components/backends/vllm
90
133
bash launch/disagg_router.sh
91
134
```
92
135
93
-
####Single Node Data Parallel Attention / Expert Parallelism
136
+
### Single Node Data Parallel Attention / Expert Parallelism
94
137
95
-
This example is not meant to be performant but showcases dynamo routing to data parallel workers
138
+
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
96
139
97
140
```bash
98
141
# requires four gpus
99
142
cd components/backends/vllm
100
143
bash launch/dep.sh
101
144
```
102
145
103
-
104
146
> [!TIP]
105
147
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
106
148
149
+
## Advanced Examples
150
+
151
+
Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
152
+
107
153
### Kubernetes Deployment
108
154
109
155
For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
@@ -118,7 +164,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
118
164
119
165
-**Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
120
166
121
-
-**Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image:
167
+
-**Container Images**: We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image:
0 commit comments