Skip to content

Commit e024c59

Browse files
committed
Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou/sla_planner_v2
2 parents a4acd2b + 382e3ae commit e024c59

File tree

25 files changed

+678
-169
lines changed

25 files changed

+678
-169
lines changed

README.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,9 @@ limitations under the License.
1919
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
2020
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
2121
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/nvidia-dynamo)
22+
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)
2223

23-
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[User Guides](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Architecture and Features](docs/architecture/architecture.md)** | **[APIs](lib/bindings/python/README.md)** | **[SDK](deploy/dynamo/sdk/README.md)** |
24-
25-
### 📢 **Please join us for our** [ **first Dynamo in-person meetup with vLLM and SGLang leads**](https://events.nvidia.com/nvidiadynamousermeetups) **on 6/5 (Thu) in SF!** ###
26-
24+
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
2725

2826
### The Era of Multi-Node, Multi-GPU
2927

benchmarks/llm/perf.sh

Lines changed: 65 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616

17-
18-
# Parse command line arguments
17+
# Default Values
1918
model="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic"
2019
url="http://localhost:8000"
2120
mode="aggregated"
2221
artifacts_root_dir="artifacts_root"
2322
deployment_kind="dynamo"
23+
concurrency_list="1,2,4,8,16,32,64,128,256"
2424

2525
# Input Sequence Length (isl) 3000 and Output Sequence Length (osl) 150 are
2626
# selected for chat use case. Note that for other use cases, the results and
@@ -35,49 +35,77 @@ prefill_dp=0
3535
decode_tp=0
3636
decode_dp=0
3737

38+
print_help() {
39+
echo "Usage: $0 [OPTIONS]"
40+
echo
41+
echo "Options:"
42+
echo " --tensor-parallelism, --tp <int> Tensor parallelism (default: $tp)"
43+
echo " --data-parallelism, --dp <int> Data parallelism (default: $dp)"
44+
echo " --prefill-tensor-parallelism, --prefill-tp <int> Prefill tensor parallelism (default: $prefill_tp)"
45+
echo " --prefill-data-parallelism, --prefill-dp <int> Prefill data parallelism (default: $prefill_dp)"
46+
echo " --decode-tensor-parallelism, --decode-tp <int> Decode tensor parallelism (default: $decode_tp)"
47+
echo " --decode-data-parallelism, --decode-dp <int> Decode data parallelism (default: $decode_dp)"
48+
echo " --model <model_id> Hugging Face model ID to benchmark (default: $model)"
49+
echo " --input-sequence-length, --isl <int> Input sequence length (default: $isl)"
50+
echo " --output-sequence-length, --osl <int> Output sequence length (default: $osl)"
51+
echo " --url <http://host:port> Target URL for inference requests (default: $url)"
52+
echo " --concurrency <list> Comma-separated concurrency levels (default: $concurrency_list)"
53+
echo " --mode <aggregated|disaggregated> Serving mode (default: $mode)"
54+
echo " --artifacts-root-dir <path> Root directory to store benchmark results (default: $artifacts_root_dir)"
55+
echo " --deployment-kind <type> Deployment tag used for pareto chart labels (default: $deployment_kind)"
56+
echo " --help Show this help message and exit"
57+
echo
58+
exit 0
59+
}
60+
61+
# Parse command line arguments
3862
# The defaults can be overridden by command line arguments.
3963
while [[ $# -gt 0 ]]; do
4064
case $1 in
41-
--tensor-parallelism)
65+
--tensor-parallelism|--tp)
4266
tp="$2"
4367
shift 2
4468
;;
45-
--data-parallelism)
69+
--data-parallelism|--dp)
4670
dp="$2"
4771
shift 2
4872
;;
49-
--prefill-tensor-parallelism)
73+
--prefill-tensor-parallelism|--prefill-tp)
5074
prefill_tp="$2"
5175
shift 2
5276
;;
53-
--prefill-data-parallelism)
77+
--prefill-data-parallelism|--prefill-dp)
5478
prefill_dp="$2"
5579
shift 2
5680
;;
57-
--decode-tensor-parallelism)
81+
--decode-tensor-parallelism|--decode-tp)
5882
decode_tp="$2"
5983
shift 2
6084
;;
61-
--decode-data-parallelism)
85+
--decode-data-parallelism|--decode-dp)
6286
decode_dp="$2"
6387
shift 2
6488
;;
65-
--model)
89+
--model)
6690
model="$2"
6791
shift 2
6892
;;
69-
--input-sequence-length)
93+
--input-sequence-length|--isl)
7094
isl="$2"
7195
shift 2
7296
;;
73-
--output-sequence-length)
97+
--output-sequence-length|--osl)
7498
osl="$2"
7599
shift 2
76100
;;
77101
--url)
78102
url="$2"
79103
shift 2
80104
;;
105+
--concurrency)
106+
concurrency_list="$2"
107+
shift 2
108+
;;
81109
--mode)
82110
mode="$2"
83111
shift 2
@@ -90,13 +118,30 @@ while [[ $# -gt 0 ]]; do
90118
deployment_kind="$2"
91119
shift 2
92120
;;
121+
--help)
122+
print_help
123+
;;
93124
*)
94125
echo "Unknown option: $1"
95126
exit 1
96127
;;
97128
esac
98129
done
99130

131+
# Function to validate if concurrency values are positive integers
132+
validate_concurrency() {
133+
for val in "${concurrency_array[@]}"; do
134+
if ! [[ "$val" =~ ^[0-9]+$ ]] || [ "$val" -le 0 ]; then
135+
echo "Error: Invalid concurrency value '$val'. Must be a positive integer." >&2
136+
exit 1
137+
fi
138+
done
139+
}
140+
141+
IFS=',' read -r -a concurrency_array <<< "$concurrency_list"
142+
# Validate concurrency values
143+
validate_concurrency
144+
100145
if [ "${mode}" == "aggregated" ]; then
101146
if [ "${tp}" == "0" ] && [ "${dp}" == "0" ]; then
102147
echo "--tensor-parallelism and --data-parallelism must be set for aggregated mode."
@@ -157,8 +202,15 @@ if [ $index -gt 0 ]; then
157202
echo "--------------------------------"
158203
fi
159204

205+
echo "Running genai-perf with:"
206+
echo "Model: $model"
207+
echo "ISL: $isl"
208+
echo "OSL: $osl"
209+
echo "Concurrency levels: ${concurrency_array[@]}"
210+
160211
# Concurrency levels to test
161-
for concurrency in 1 2 4 8 16 32 64 128 256; do
212+
for concurrency in "${concurrency_array[@]}"; do
213+
echo "Run concurrency: $concurrency"
162214

163215
# NOTE: For Dynamo HTTP OpenAI frontend, use `nvext` for fields like
164216
# `ignore_eos` since they are not in the official OpenAI spec.
@@ -185,7 +237,7 @@ for concurrency in 1 2 4 8 16 32 64 128 256; do
185237
--artifact-dir ${artifact_dir} \
186238
-- \
187239
-v \
188-
--max-threads 256 \
240+
--max-threads ${concurrency} \
189241
-H 'Authorization: Bearer NOT USED' \
190242
-H 'Accept: text/event-stream'
191243

docs/API/sdk.md

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,9 @@ limitations under the License.
1717

1818
# Dynamo SDK
1919

20-
# Table of Contents
21-
22-
- [Introduction](#introduction)
23-
- [Installation](#installation)
24-
- [Core Concepts](#core-concepts)
25-
- [Writing a Service](#writing-a-service)
26-
- [Configuring a Service](#configuring-a-service)
27-
- [Composing Services into an Graph](#composing-services-into-an-graph)
2820
## Introduction
2921

30-
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See Python Bindings](./python_bindings.md).
22+
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
3123

3224
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
3325

docs/architecture/architecture.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,13 @@ limitations under the License.
2020

2121
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
2222

23-
- **Disaggregated prefill & decode inference** Maximizes GPU throughput and helps you balance throughput and latency
24-
- **Dynamic GPU scheduling** Optimizes performance based on real-time demand
25-
- **LLM-aware request routing** Eliminates unnecessary KV cache recomputation
26-
- **Accelerated data transfer** Reduces inference response time using NIXL
27-
- **KV cache offloading**Uses multiple memory hierarchies for higher system throughput
23+
- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
24+
- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
25+
- **LLM-aware request routing**: Eliminates unnecessary KV cache recomputation
26+
- **Accelerated data transfer**: Reduces inference response time using NIXL
27+
- **KV cache offloading**: Uses multiple memory hierarchies for higher system throughput and lower latency
2828

29-
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach
29+
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach
3030

3131
## Motivation behind Dynamo
3232

docs/architecture/distributed_runtime.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -61,11 +61,11 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
6161

6262
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
6363

64-
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_routers.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
64+
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
6565

66-
- `random`: randomly select an endpoint to hit,
67-
- `round_robin`: select endpoints in round-robin order,
68-
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint.
66+
- `random`: randomly select an endpoint to hit
67+
- `round_robin`: select endpoints in round-robin order
68+
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
6969

7070
After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
7171

@@ -77,7 +77,7 @@ We provide native rust and python (through binding) examples for basic usage of
7777
- Python: `/lib/bindings/python/examples/`. We also provide a complete example of using `DistributedRuntime` for communication and Dynamo's LLM library for prompt templates and (de)tokenization to deploy a vllm-based service. Please refer to `lib/bindings/python/examples/hello_world/server_vllm.py` for details.
7878

7979
```{note}
80-
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require exgtensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
80+
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to be slow and requires extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
8181
8282
You can tune the number of parallel build jobs for building VLLM from source
8383
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.

0 commit comments

Comments
 (0)