Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
d5cc512
fix trt-llm readme per ryan's feedback on attention dp
athreesh Aug 4, 2025
be7797f
adding in docs reference for metrics
athreesh Aug 4, 2025
11146a8
fix all readmes + added in quickstart commands
athreesh Aug 5, 2025
8b1e3e6
update --use-kv-events to --kv-events flag
athreesh Aug 5, 2025
fad6124
updating planner doc
athreesh Aug 5, 2025
d3dc695
fixing merge conflict
athreesh Aug 5, 2025
34d95ee
examples update to fix merge conflict
athreesh Aug 5, 2025
05da6ed
add sglang doc
athreesh Aug 5, 2025
f49f037
adding api/nixl topics to hidden_toc.rst
kmkelle-nv Aug 5, 2025
19b282f
small change so biswa can start
athreesh Aug 5, 2025
404b8c9
Merge branch 'backport/anish-index-rst-into-0.4.0' of https://github.…
athreesh Aug 5, 2025
6220de1
lot of trt-llm readme link fixes
athreesh Aug 5, 2025
3f9f892
trt-llm links fixed
athreesh Aug 5, 2025
aabd7c9
fixing 1 example link
athreesh Aug 5, 2025
4a3f2b0
vllm link fixes
athreesh Aug 5, 2025
45a36a7
fix toctree
athreesh Aug 6, 2025
e5914cc
fix: sglang
biswapanda Aug 6, 2025
9d41014
docs: add helm install guide and sglang deploy README
athreesh Aug 6, 2025
17f3539
docs: fix examples links
nealvaidya Aug 6, 2025
dbc0192
fixes to merge conflict
athreesh Aug 6, 2025
c6eef3e
Merge branch 'backport/anish-index-rst-into-0.4.0' of https://github.…
athreesh Aug 6, 2025
6227d22
fixed links more
athreesh Aug 6, 2025
5809a1b
fixed vllm links, toctree left
athreesh Aug 6, 2025
50d6fa3
fixing the runtime link ref
athreesh Aug 6, 2025
5a7baa0
mv multinode-examples copy to multinode-examples
biswapanda Aug 6, 2025
8891bd6
docs: fix up toctress
nealvaidya Aug 6, 2025
e12ced6
fix quickstart instructions
athreesh Aug 6, 2025
2d72ec5
Merge branch 'backport/anish-index-rst-into-0.4.0' of https://github.…
athreesh Aug 6, 2025
3eb3974
sgl references
athreesh Aug 6, 2025
b50c7c3
sgl references fix
athreesh Aug 6, 2025
b6482ed
docs: fix index.rst
nealvaidya Aug 6, 2025
e1916d8
make hidden toctree orphan
nealvaidya Aug 6, 2025
fb199c9
adjusted index.rst
athreesh Aug 6, 2025
e4cf69f
Merge branch 'backport/anish-index-rst-into-0.4.0' of https://github.…
athreesh Aug 6, 2025
022aaf2
Resolve merge conflict: remove duplicate line in dynamo_deploy README
athreesh Aug 6, 2025
359a256
Merge branch 'release/0.4.0' into backport/anish-index-rst-into-0.4.0
athreesh Aug 6, 2025
a0c6cdb
fix another link lol
athreesh Aug 6, 2025
ce174f7
doc rendering fixes
nealvaidya Aug 6, 2025
05db79c
fix whitespace issues
nealvaidya Aug 6, 2025
d7b2271
add missing triple tildes
biswapanda Aug 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

### Large Scale P/D and WideEP Features

| Feature | SGLang | Notes |
|--------------------|--------|-----------------------------------------------------------------------|
| **WideEP** | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
| **DP Rank Routing**| 🚧 | Direct routing supported. Process per DP rank is not supported |
| **GB200 Support** | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
| Feature | SGLang | Notes |
|---------------------|--------|--------------------------------------------------------------|
| **WideEP** | ✅ | Full support on H100s/GB200 |
| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker |
| **GB200 Support** | ✅ | |


## Quick Start
Expand Down
109 changes: 1 addition & 108 deletions components/backends/sglang/slurm_jobs/README.md
Original file line number Diff line number Diff line change
@@ -1,108 +1 @@
# Example: Deploy Multi-node SGLang with Dynamo on SLURM

This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster.

## Overview

The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode.
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.

## Scripts

- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks

## Logs Folder Structure

Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.

### Log File Structure

```
logs/
├── 3062824/ # Job ID directory
│ ├── log.out # Main job output (node allocation, IP addresses, launch commands)
│ ├── log.err # Main job errors
│ ├── node0197_prefill.out # Prefill node stdout (node0197)
│ ├── node0197_prefill.err # Prefill node stderr (node0197)
│ ├── node0200_prefill.out # Prefill node stdout (node0200)
│ ├── node0200_prefill.err # Prefill node stderr (node0200)
│ ├── node0201_decode.out # Decode node stdout (node0201)
│ ├── node0201_decode.err # Decode node stderr (node0201)
│ ├── node0204_decode.out # Decode node stdout (node0204)
│ ├── node0204_decode.err # Decode node stderr (node0204)
│ ├── node0197_prefill_gpu_utilization.log # GPU utilization monitoring (node0197)
│ ├── node0200_prefill_gpu_utilization.log # GPU utilization monitoring (node0200)
│ ├── node0201_decode_gpu_utilization.log # GPU utilization monitoring (node0201)
│ └── node0204_decode_gpu_utilization.log # GPU utilization monitoring (node0204)
├── 3063137/ # Another job ID directory
├── 3062689/ # Another job ID directory
└── ...
```

## Setup

For simplicity of the example, we will make some assumptions about your SLURM cluster:
1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup.
2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
SPANK plugin setup. In particular, the `job_script_template.j2` template in this
example will use `srun` arguments like `--container-image`,
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../docs/dsr1-wideep-h100.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps.

## Usage

1. **Submit a benchmark job**:
```bash
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account
```

**Required arguments**:
- `--template`: Path to Jinja2 template file
- `--model-dir`: Model directory path
- `--config-dir`: Config directory path
- `--container-image`: Container image URI (e.g., `registry/repository:tag`)
- `--account`: SLURM account

**Optional arguments**:
- `--prefill-nodes`: Number of prefill nodes (default: `2`)
- `--decode-nodes`: Number of decode nodes (default: `2`)
- `--gpus-per-node`: Number of GPUs per node (default: `8`)
- `--network-interface`: Network interface to use (default: `eth3`)
- `--job-name`: SLURM job name (default: `dynamo_setup`)
- `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)

**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.

2. **Monitor job progress**:
```bash
squeue -u $USER
```

3. **Check logs in real-time**:
```bash
tail -f logs/{JOB_ID}/log.out
```

4. **Monitor GPU utilization**:
```bash
tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
```

## Outputs

Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
Please refer to [Deploying Dynamo with SGLang on SLURM](../../../../../docs/components/backends/sglang/slurm_jobs/README.md) for more details.
35 changes: 14 additions & 21 deletions components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,19 +49,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

| Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | 🚧 | Planned |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
| [**Disaggregated Serving**](../../../architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../architecture/sla_planner.md) | 🚧 | Planned |
| [**Load Based Planner**](../../../architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../architecture/kvbm_architecture.md) | 🚧 | Planned |

### Large Scale P/D and WideEP Features

| Feature | TensorRT-LLM | Notes |
|--------------------|--------------|-----------------------------------------------------------------------|
| **WideEP** | ✅ | |
| **DP Rank Routing**| ✅ | |
| **Attention DP** | ✅ | |
| **GB200 Support** | ✅ | |

## Quick Start
Expand All @@ -70,7 +70,7 @@ Below we provide a guide that lets you run all of our the common deployment patt

### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
Start using Docker Compose

```bash
docker compose -f deploy/docker-compose.yml up -d
Expand Down Expand Up @@ -180,7 +180,7 @@ Below we provide a selected list of advanced examples. Please open up an issue i

### Multinode Deployment

For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.

### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
Expand All @@ -191,15 +191,15 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo

### Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.

### Benchmarking

To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)

`model` name and `host` based on your deployment:
```bash
{REPO_ROOT}/benchmarks/llm/perf.sh
```

## Disaggregation Strategy

Expand Down Expand Up @@ -236,11 +236,4 @@ The migrated request will continue responding to the original request, allowing

## Client

See the [quickstart guide](../../../examples/basics/quickstart/README.md#3-send-requests) to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.

## Benchmarking

To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
16 changes: 8 additions & 8 deletions components/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

| Feature | vLLM | Notes |
|---------|------|-------|
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |
| [**Disaggregated Serving**](../../../architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../architecture/kvbm_architecture.md) | 🚧 | WIP |

### Large Scale P/D and WideEP Features

| Feature | vLLM | Notes |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP** | ✅ | Support for PPLX / DeepEP not verified |
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **Attention DP** | ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |

## Quick Start
Expand All @@ -56,7 +56,7 @@ Below we provide a guide that lets you run all of our the common deployment patt

### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
Start using Docker Compose

```bash
docker compose -f deploy/docker-compose.yml up -d
Expand Down
1 change: 1 addition & 0 deletions deploy/helm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Here is how you would install a VLLM inference backend example.

```bash
helm upgrade --install dynamo-graph ./deploy/helm/chart -n dynamo-cloud -f ./components/backends/vllm/deploy/agg.yaml
```

### Installation using Grove

Expand Down
85 changes: 0 additions & 85 deletions docs/API/nixl_connect/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,97 +64,13 @@ sequenceDiagram
RemoteWorker -->> LocalWorker: Notify completion (unblock awaiter)
```

## Examples

### Generic Example

In the diagram below, Local creates a [`WritableOperation`](writable_operation.md) intended to receive data from Remote.
Local then sends metadata about the requested RDMA operation to Remote.
Remote then uses the metadata to create a [`WriteOperation`](write_operation.md) which will perform the GPU Direct RDMA memory transfer from Remote's GPU memory to Local's GPU memory.

```mermaid
---
title: Write Operation Between Two Workers
---
flowchart LR
c1[Remote] --"3: .begin_write()"--- WriteOperation
WriteOperation e1@=="4: GPU Direct RDMA"==> WritableOperation
WritableOperation --"1: .create_writable()"--- c2[Local]
c2 e2@--"2: RDMA Metadata via HTTP"--> c1
e1@{ animate: true; }
e2@{ animate: true; }
```

### Multimodal Example

In the case of the [Dynamo Multimodal Disaggregated Example](../../examples/multimodal/README.md):

1. The HTTP frontend accepts a text prompt and a URL to an image.

2. The prompt and URL are then enqueued with the Processor before being dispatched to the first available Decode Worker.

3. Decode Worker then requests a Prefill Worker to provide key-value data for the LLM powering the Decode Worker.

4. Prefill Worker then requests that the image be processed and provided as embeddings by the Encode Worker.

5. Encode Worker acquires the image, processes it, performs inference on the image using a specialized vision model, and finally provides the embeddings to Prefill Worker.

6. Prefill Worker receives the embeddings from Encode Worker and generates a key-value cache (KV$) update for Decode Worker's LLM and writes the update directly to the GPU memory reserved for the data.

7. Finally, Decode Worker performs the requested inference.

```mermaid
---
title: Multimodal Disaggregated Workflow
---
flowchart LR
p0[HTTP Frontend] i0@--"text prompt"-->p1[Processor]
p0 i1@--"url"-->p1
p1 i2@--"prompt"-->dw[Decode Worker]
p1 i3@--"url"-->dw
dw i4@--"prompt"-->pw[Prefill Worker]
dw i5@--"url"-->pw
pw i6@--"url"-->ew[Encode Worker]
ew o0@=="image embeddings"==>pw
pw o1@=="kv_cache updates"==>dw
dw o2@--"inference results"-->p0

i0@{ animate: true; }
i1@{ animate: true; }
i2@{ animate: true; }
i3@{ animate: true; }
i4@{ animate: true; }
i5@{ animate: true; }
i6@{ animate: true; }
o0@{ animate: true; }
o1@{ animate: true; }
o2@{ animate: true; }
```

> [!Note]
> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
> The KV Cache transfer between Decode Worker and Prefill Worker utilizes the NIXL base RDMA subsystem directly without using the Dynamo NIXL Connect library.

#### Code Examples

See [prefill_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/prefill_worker.py#L199) or [decode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/decode_worker.py#L239) from our Multimodal example,
for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md),
sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.

See [encode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/encode_worker.py#L190) from our Multimodal example,
for how the resulting embeddings are registered with the RDMA subsystem by creating a [`Descriptor`](descriptor.md),
a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker,
and the worker awaits for the data transfer to complete for yielding a response.


## Python Classes

- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [ReadOperation](read_operation.md)
- [ReadableOperation](readable_operation.md)
- [SerializedRequest](serialized_request.md)
- [WritableOperation](writable_operation.md)
- [WriteOperation](write_operation.md)

Expand All @@ -164,5 +80,4 @@ and the worker awaits for the data transfer to complete for yielding a response.
- [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
- [NVIDIA Dynamo NIXL Connect](https://github.com/ai-dynamo/dynamo/tree/main/docs/runtime/nixl_connect)
- [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
- [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal)
- [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
2 changes: 1 addition & 1 deletion docs/architecture/dynamo_flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Coordination and messaging support:

## Technical Implementation Details

### NIXL (NVIDIA Interchange Library):
### NIXL (NVIDIA Inference Xfer Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination
- PrefillWorker loads metadata to establish direct communication channels
Expand Down
Loading
Loading