ai-dynamo
diff --git a/‎README.md‎
Lines changed: 2 additions & 4 deletions b/‎README.md‎
Lines changed: 2 additions & 4 deletions
diff --git a/‎benchmarks/llm/perf.sh‎
Lines changed: 65 additions & 13 deletions b/‎benchmarks/llm/perf.sh‎
Lines changed: 65 additions & 13 deletions
diff --git a/‎docs/API/sdk.md‎
Lines changed: 1 addition & 9 deletions b/‎docs/API/sdk.md‎
Lines changed: 1 addition & 9 deletions
diff --git a/‎docs/architecture/architecture.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/architecture/architecture.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/architecture/distributed_runtime.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/architecture/distributed_runtime.md‎
Lines changed: 5 additions & 5 deletions
@@ -19,11 +19,9 @@ limitations under the License.
 [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 [![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
 [![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/nvidia-dynamo)
+[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)
 
-| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[User Guides](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Architecture and Features](docs/architecture/architecture.md)** | **[APIs](lib/bindings/python/README.md)** | **[SDK](deploy/dynamo/sdk/README.md)** |
-
-### 📢 **Please join us for our** [ **first Dynamo in-person meetup with vLLM and SGLang leads**](https://events.nvidia.com/nvidiadynamousermeetups) **on 6/5 (Thu) in SF!** ###
-
+| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
 
 ### The Era of Multi-Node, Multi-GPU
 
 
@@ -14,13 +14,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-
-# Parse command line arguments
+# Default Values
 model="neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic"
 url="http://localhost:8000"
 mode="aggregated"
 artifacts_root_dir="artifacts_root"
 deployment_kind="dynamo"
+concurrency_list="1,2,4,8,16,32,64,128,256"
 
 # Input Sequence Length (isl) 3000 and Output Sequence Length (osl) 150 are
 # selected for chat use case. Note that for other use cases, the results and
@@ -35,49 +35,77 @@ prefill_dp=0
 decode_tp=0
 decode_dp=0
 
+print_help() {
+  echo "Usage: $0 [OPTIONS]"
+  echo
+  echo "Options:"
+  echo "  --tensor-parallelism, --tp <int>           Tensor parallelism (default: $tp)"
+  echo "  --data-parallelism, --dp <int>             Data parallelism (default: $dp)"
+  echo "  --prefill-tensor-parallelism, --prefill-tp <int>   Prefill tensor parallelism (default: $prefill_tp)"
+  echo "  --prefill-data-parallelism, --prefill-dp <int>     Prefill data parallelism (default: $prefill_dp)"
+  echo "  --decode-tensor-parallelism, --decode-tp <int>     Decode tensor parallelism (default: $decode_tp)"
+  echo "  --decode-data-parallelism, --decode-dp <int>       Decode data parallelism (default: $decode_dp)"
+  echo "  --model <model_id>                         Hugging Face model ID to benchmark (default: $model)"
+  echo "  --input-sequence-length, --isl <int>       Input sequence length (default: $isl)"
+  echo "  --output-sequence-length, --osl <int>      Output sequence length (default: $osl)"
+  echo "  --url <http://host:port>                   Target URL for inference requests (default: $url)"
+  echo "  --concurrency <list>                       Comma-separated concurrency levels (default: $concurrency_list)"
+  echo "  --mode <aggregated|disaggregated>          Serving mode (default: $mode)"
+  echo "  --artifacts-root-dir <path>                Root directory to store benchmark results (default: $artifacts_root_dir)"
+  echo "  --deployment-kind <type>                   Deployment tag used for pareto chart labels (default: $deployment_kind)"
+  echo "  --help                                     Show this help message and exit"
+  echo
+  exit 0
+}
+
+# Parse command line arguments
 # The defaults can be overridden by command line arguments.
 while [[ $# -gt 0 ]]; do
   case $1 in
-    --tensor-parallelism)
+    --tensor-parallelism|--tp)
       tp="$2"
       shift 2
       ;;
-    --data-parallelism)
+    --data-parallelism|--dp)
       dp="$2"
       shift 2
       ;;
-    --prefill-tensor-parallelism)
+    --prefill-tensor-parallelism|--prefill-tp)
       prefill_tp="$2"
       shift 2
       ;;
-    --prefill-data-parallelism)
+    --prefill-data-parallelism|--prefill-dp)
       prefill_dp="$2"
       shift 2
       ;;
-    --decode-tensor-parallelism)
+    --decode-tensor-parallelism|--decode-tp)
       decode_tp="$2"
       shift 2
       ;;
-    --decode-data-parallelism)
+    --decode-data-parallelism|--decode-dp)
       decode_dp="$2"
       shift 2
       ;;
-      --model)
+    --model)
       model="$2"
       shift 2
       ;;
-    --input-sequence-length)
+    --input-sequence-length|--isl)
       isl="$2"
       shift 2
       ;;
-    --output-sequence-length)
+    --output-sequence-length|--osl)
       osl="$2"
       shift 2
       ;;
     --url)
       url="$2"
       shift 2
       ;;
+    --concurrency)
+      concurrency_list="$2"
+      shift 2
+      ;;
     --mode)
       mode="$2"
       shift 2
@@ -90,13 +118,30 @@ while [[ $# -gt 0 ]]; do
       deployment_kind="$2"
       shift 2
       ;;
+    --help)
+      print_help
+      ;;
     *)
       echo "Unknown option: $1"
       exit 1
       ;;
   esac
 done
 
+# Function to validate if concurrency values are positive integers
+validate_concurrency() {
+  for val in "${concurrency_array[@]}"; do
+    if ! [[ "$val" =~ ^[0-9]+$ ]] || [ "$val" -le 0 ]; then
+      echo "Error: Invalid concurrency value '$val'. Must be a positive integer." >&2
+      exit 1
+    fi
+  done
+}
+
+IFS=',' read -r -a concurrency_array <<< "$concurrency_list"
+# Validate concurrency values
+validate_concurrency
+
 if [ "${mode}" == "aggregated" ]; then
   if [ "${tp}" == "0" ] && [ "${dp}" == "0" ]; then
     echo "--tensor-parallelism and --data-parallelism must be set for aggregated mode."
@@ -157,8 +202,15 @@ if [ $index -gt 0 ]; then
     echo "--------------------------------"
 fi
 
+echo "Running genai-perf with:"
+echo "Model: $model"
+echo "ISL: $isl"
+echo "OSL: $osl"
+echo "Concurrency levels: ${concurrency_array[@]}"
+
 # Concurrency levels to test
-for concurrency in 1 2 4 8 16 32 64 128 256; do
+for concurrency in "${concurrency_array[@]}"; do
+  echo "Run concurrency: $concurrency"
 
   # NOTE: For Dynamo HTTP OpenAI frontend, use `nvext` for fields like
   # `ignore_eos` since they are not in the official OpenAI spec.
@@ -185,7 +237,7 @@ for concurrency in 1 2 4 8 16 32 64 128 256; do
     --artifact-dir ${artifact_dir} \
     -- \
     -v \
-    --max-threads 256 \
+    --max-threads ${concurrency} \
     -H 'Authorization: Bearer NOT USED' \
     -H 'Accept: text/event-stream'
 
 
@@ -17,17 +17,9 @@ limitations under the License.
 
 # Dynamo SDK
 
-# Table of Contents
-
-- [Introduction](#introduction)
-- [Installation](#installation)
-- [Core Concepts](#core-concepts)
-- [Writing a Service](#writing-a-service)
-- [Configuring a Service](#configuring-a-service)
-- [Composing Services into an Graph](#composing-services-into-an-graph)
 ## Introduction
 
-Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See Python Bindings](./python_bindings.md).
+Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
 
 Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
 
 
@@ -20,13 +20,13 @@ limitations under the License.
 
 Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
 
-- **Disaggregated prefill & decode inference** – Maximizes GPU throughput and helps you balance throughput and latency
-- **Dynamic GPU scheduling** – Optimizes performance based on real-time demand
-- **LLM-aware request routing** – Eliminates unnecessary KV cache recomputation
-- **Accelerated data transfer** – Reduces inference response time using NIXL
-- **KV cache offloading** – Uses multiple memory hierarchies for higher system throughput
+- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
+- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
+- **LLM-aware request routing**: Eliminates unnecessary KV cache recomputation
+- **Accelerated data transfer**: Reduces inference response time using NIXL
+- **KV cache offloading**: Uses multiple memory hierarchies for higher system throughput and lower latency
 
-Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach
+Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach
 
 ## Motivation behind Dynamo
 
 
@@ -61,11 +61,11 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
 
 Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
 
-The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_routers.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
+The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
 
-- `random`: randomly select an endpoint to hit,
-- `round_robin`: select endpoints in round-robin order,
-- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint.
+- `random`: randomly select an endpoint to hit
+- `round_robin`: select endpoints in round-robin order
+- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
 
 After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
 
@@ -77,7 +77,7 @@ We provide native rust and python (through binding) examples for basic usage of
 - Python: `/lib/bindings/python/examples/`. We also provide a complete example of using `DistributedRuntime` for communication and Dynamo's LLM library for prompt templates and (de)tokenization to deploy a vllm-based service. Please refer to `lib/bindings/python/examples/hello_world/server_vllm.py` for details.
 
 ```{note}
-Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require exgtensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
+Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to be slow and requires extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
 
 You can tune the number of parallel build jobs for building VLLM from source
 on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.