ai-dynamo · dmitry-tokarev-nv · Aug 6, 2025 · Aug 4, 2025 · Aug 4, 2025 · Aug 5, 2025
diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md
@@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 ### Large Scale P/D and WideEP Features
 
-| Feature            | SGLang | Notes                                                                 |
-|--------------------|--------|-----------------------------------------------------------------------|
-| **WideEP**         | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556)                                     |
-| **DP Rank Routing**| 🚧    | Direct routing supported. Process per DP rank is not supported        |
-| **GB200 Support**  | 🚧    | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
+| Feature             | SGLang | Notes                                                        |
+|---------------------|--------|--------------------------------------------------------------|
+| **WideEP**          | ✅     | Full support on H100s/GB200                                  |
+| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
+| **GB200 Support**   | ✅     |                                                              |
 
 
 ## Quick Start

diff --git a/components/backends/sglang/slurm_jobs/README.md b/components/backends/sglang/slurm_jobs/README.md
@@ -1,108 +1 @@
-# Example: Deploy Multi-node SGLang with Dynamo on SLURM
-
-This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster.
-
-## Overview
-
-The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode.
-The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
-
-## Scripts
-
-- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
-- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
-- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
-- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
-
-## Logs Folder Structure
-
-Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
-
-### Log File Structure
-
-```
-logs/
-├── 3062824/                    # Job ID directory
-│   ├── log.out                 # Main job output (node allocation, IP addresses, launch commands)
-│   ├── log.err                 # Main job errors
-│   ├── node0197_prefill.out     # Prefill node stdout (node0197)
-│   ├── node0197_prefill.err     # Prefill node stderr (node0197)
-│   ├── node0200_prefill.out     # Prefill node stdout (node0200)
-│   ├── node0200_prefill.err     # Prefill node stderr (node0200)
-│   ├── node0201_decode.out      # Decode node stdout (node0201)
-│   ├── node0201_decode.err      # Decode node stderr (node0201)
-│   ├── node0204_decode.out      # Decode node stdout (node0204)
-│   ├── node0204_decode.err      # Decode node stderr (node0204)
-│   ├── node0197_prefill_gpu_utilization.log    # GPU utilization monitoring (node0197)
-│   ├── node0200_prefill_gpu_utilization.log    # GPU utilization monitoring (node0200)
-│   ├── node0201_decode_gpu_utilization.log     # GPU utilization monitoring (node0201)
-│   └── node0204_decode_gpu_utilization.log     # GPU utilization monitoring (node0204)
-├── 3063137/                    # Another job ID directory
-├── 3062689/                    # Another job ID directory
-└── ...
-```
-
-## Setup
-
-For simplicity of the example, we will make some assumptions about your SLURM cluster:
-1. We assume you have access to a SLURM cluster with multiple GPU nodes
-   available. For functional testing, most setups should be fine. For performance
-   testing, you should aim to allocate groups of nodes that are performantly
-   inter-connected, such as those in an NVL72 setup.
-2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
-   SPANK plugin setup. In particular, the `job_script_template.j2` template in this
-   example will use `srun` arguments like `--container-image`,
-   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
-   If your cluster supports similar container based plugins, you may be able to
-   modify the template to use that instead.
-3. We assume you have already built a recent Dynamo+SGLang container image as
-   described [here](../docs/dsr1-wideep-h100.md#instructions).
-   This is the image that can be passed to the `--container-image` argument in later steps.
-
-## Usage
-
-1. **Submit a benchmark job**:
-   ```bash
-   python submit_job_script.py \
-     --template job_script_template.j2 \
-     --model-dir /path/to/model \
-     --config-dir /path/to/configs \
-     --container-image container-image-uri \
-     --account your-slurm-account
-   ```
-
-   **Required arguments**:
-   - `--template`: Path to Jinja2 template file
-   - `--model-dir`: Model directory path
-   - `--config-dir`: Config directory path
-   - `--container-image`: Container image URI (e.g., `registry/repository:tag`)
-   - `--account`: SLURM account
-
-   **Optional arguments**:
-   - `--prefill-nodes`: Number of prefill nodes (default: `2`)
-   - `--decode-nodes`: Number of decode nodes (default: `2`)
-   - `--gpus-per-node`: Number of GPUs per node (default: `8`)
-   - `--network-interface`: Network interface to use (default: `eth3`)
-   - `--job-name`: SLURM job name (default: `dynamo_setup`)
-   - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
-
-   **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
-
-2. **Monitor job progress**:
-   ```bash
-   squeue -u $USER
-   ```
-
-3. **Check logs in real-time**:
-   ```bash
-   tail -f logs/{JOB_ID}/log.out
-   ```
-
-4. **Monitor GPU utilization**:
-   ```bash
-   tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
-   ```
-
-## Outputs
-
-Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
+Please refer to [Deploying Dynamo with SGLang on SLURM](../../../../../docs/components/backends/sglang/slurm_jobs/README.md) for more details.
diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
@@ -49,19 +49,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 | Feature | TensorRT-LLM | Notes |
 |---------|--------------|-------|
-| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | 🚧 | Planned |
-| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
-| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
+| [**Disaggregated Serving**](../../../architecture/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
+| [**KV-Aware Routing**](../../../architecture/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../../architecture/sla_planner.md) | 🚧 | Planned |
+| [**Load Based Planner**](../../../architecture/load_planner.md) | 🚧 | Planned |
+| [**KVBM**](../../../architecture/kvbm_architecture.md) | 🚧 | Planned |
 
 ### Large Scale P/D and WideEP Features
 
 | Feature            | TensorRT-LLM | Notes                                                                 |
 |--------------------|--------------|-----------------------------------------------------------------------|
 | **WideEP**         | ✅           |                                                                 |
-| **DP Rank Routing**| ✅           |                                                                 |
+| **Attention DP**   | ✅           |                                                                 |
 | **GB200 Support**  | ✅           |                                                                 |
 
 ## Quick Start
@@ -70,7 +70,7 @@ Below we provide a guide that lets you run all of our the common deployment patt
 
 ### Start NATS and ETCD in the background
 
-Start using [Docker Compose](../../../deploy/docker-compose.yml)
+Start using Docker Compose
 
 ```bash
 docker compose -f deploy/docker-compose.yml up -d
@@ -180,7 +180,7 @@ Below we provide a selected list of advanced examples. Please open up an issue i
 
 ### Multinode Deployment
 
-For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
+For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
 
 ### Speculative Decoding
 - **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
@@ -191,15 +191,15 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo
 
 ### Client
 
-See [client](../llm/README.md#client) section to learn how to send request to the deployment.
-
-NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
+To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
 
 ### Benchmarking
 
 To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
-
+`model` name and `host` based on your deployment:
+```bash
+{REPO_ROOT}/benchmarks/llm/perf.sh
+```
 
 ## Disaggregation Strategy
 
@@ -236,11 +236,4 @@ The migrated request will continue responding to the original request, allowing
 
 ## Client
 
-See the [quickstart guide](../../../examples/basics/quickstart/README.md#3-send-requests) to learn how to send request to the deployment.
-
 NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
-
-## Benchmarking
-
-To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md
@@ -35,19 +35,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 | Feature | vLLM | Notes |
 |---------|------|-------|
-| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
-| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ |  |
-| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP |
-| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |
+| [**Disaggregated Serving**](../../../architecture/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../../architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
+| [**KV-Aware Routing**](../../../architecture/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../../architecture/sla_planner.md) | ✅ |  |
+| [**Load Based Planner**](../../../architecture/load_planner.md) | 🚧 | WIP |
+| [**KVBM**](../../../architecture/kvbm_architecture.md) | 🚧 | WIP |
 
 ### Large Scale P/D and WideEP Features
 
 | Feature            | vLLM | Notes                                                                 |
 |--------------------|------|-----------------------------------------------------------------------|
 | **WideEP**         | ✅   | Support for PPLX / DeepEP not verified                                           |
-| **DP Rank Routing**| ✅   | Supported via external control of DP ranks |
+| **Attention DP**   | ✅   | Supported via external control of DP ranks |
 | **GB200 Support**  | 🚧   | Container functional on main |
 
 ## Quick Start
@@ -56,7 +56,7 @@ Below we provide a guide that lets you run all of our the common deployment patt
 
 ### Start NATS and ETCD in the background
 
-Start using [Docker Compose](../../../deploy/docker-compose.yml)
+Start using Docker Compose
 
 ```bash
 docker compose -f deploy/docker-compose.yml up -d

@@ -34,6 +34,7 @@ Here is how you would install a VLLM inference backend example.
 
 ```bash
 helm upgrade --install dynamo-graph ./deploy/helm/chart -n dynamo-cloud -f ./components/backends/vllm/deploy/agg.yaml
+```
 
 ### Installation using Grove
 

diff --git a/docs/API/nixl_connect/README.md b/docs/API/nixl_connect/README.md
@@ -64,97 +64,13 @@ sequenceDiagram
     RemoteWorker -->> LocalWorker: Notify completion (unblock awaiter)
 ```
 
-## Examples
-
-### Generic Example
-
-In the diagram below, Local creates a [`WritableOperation`](writable_operation.md) intended to receive data from Remote.
-Local then sends metadata about the requested RDMA operation to Remote.
-Remote then uses the metadata to create a [`WriteOperation`](write_operation.md) which will perform the GPU Direct RDMA memory transfer from Remote's GPU memory to Local's GPU memory.
-
-```mermaid
----
-title: Write Operation Between Two Workers
----
-flowchart LR
-  c1[Remote] --"3: .begin_write()"--- WriteOperation
-  WriteOperation e1@=="4: GPU Direct RDMA"==> WritableOperation
-  WritableOperation --"1: .create_writable()"--- c2[Local]
-  c2 e2@--"2: RDMA Metadata via HTTP"--> c1
-  e1@{ animate: true; }
-  e2@{ animate: true; }
-```
-
-### Multimodal Example
-
-In the case of the [Dynamo Multimodal Disaggregated Example](../../examples/multimodal/README.md):
-
- 1. The HTTP frontend accepts a text prompt and a URL to an image.
-
- 2. The prompt and URL are then enqueued with the Processor before being dispatched to the first available Decode Worker.
-
- 3. Decode Worker then requests a Prefill Worker to provide key-value data for the LLM powering the Decode Worker.
-
- 4. Prefill Worker then requests that the image be processed and provided as embeddings by the Encode Worker.
-
- 5. Encode Worker acquires the image, processes it, performs inference on the image using a specialized vision model, and finally provides the embeddings to Prefill Worker.
-
- 6. Prefill Worker receives the embeddings from Encode Worker and generates a key-value cache (KV$) update for Decode Worker's LLM and writes the update directly to the GPU memory reserved for the data.
-
- 7. Finally, Decode Worker performs the requested inference.
-
-```mermaid
----
-title: Multimodal Disaggregated Workflow
----
-flowchart LR
-  p0[HTTP Frontend] i0@--"text prompt"-->p1[Processor]
-  p0 i1@--"url"-->p1
-  p1 i2@--"prompt"-->dw[Decode Worker]
-  p1 i3@--"url"-->dw
-  dw i4@--"prompt"-->pw[Prefill Worker]
-  dw i5@--"url"-->pw
-  pw i6@--"url"-->ew[Encode Worker]
-  ew o0@=="image embeddings"==>pw
-  pw o1@=="kv_cache updates"==>dw
-  dw o2@--"inference results"-->p0
-
-  i0@{ animate: true; }
-  i1@{ animate: true; }
-  i2@{ animate: true; }
-  i3@{ animate: true; }
-  i4@{ animate: true; }
-  i5@{ animate: true; }
-  i6@{ animate: true; }
-  o0@{ animate: true; }
-  o1@{ animate: true; }
-  o2@{ animate: true; }
-```
-
-> [!Note]
-> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
-> The KV Cache transfer between Decode Worker and Prefill Worker utilizes the NIXL base RDMA subsystem directly without using the Dynamo NIXL Connect library.
-
-#### Code Examples
-
-See [prefill_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/prefill_worker.py#L199) or [decode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/decode_worker.py#L239) from our Multimodal example,
-for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md),
-sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.
-
-See [encode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/encode_worker.py#L190) from our Multimodal example,
-for how the resulting embeddings are registered with the RDMA subsystem by creating a [`Descriptor`](descriptor.md),
-a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker,
-and the worker awaits for the data transfer to complete for yielding a response.
-
-
 ## Python Classes
 
   - [Connector](connector.md)
   - [Descriptor](descriptor.md)
   - [Device](device.md)
   - [ReadOperation](read_operation.md)
   - [ReadableOperation](readable_operation.md)
-  - [SerializedRequest](serialized_request.md)
   - [WritableOperation](writable_operation.md)
   - [WriteOperation](write_operation.md)
 
@@ -164,5 +80,4 @@ and the worker awaits for the data transfer to complete for yielding a response.
   - [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
     - [NVIDIA Dynamo NIXL Connect](https://github.com/ai-dynamo/dynamo/tree/main/docs/runtime/nixl_connect)
   - [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
-  - [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal)
   - [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
diff --git a/docs/architecture/dynamo_flow.md b/docs/architecture/dynamo_flow.md
@@ -67,7 +67,7 @@ Coordination and messaging support:
 
 ## Technical Implementation Details
 
-### NIXL (NVIDIA Interchange Library):
+### NIXL (NVIDIA Inference Xfer Library):
 - Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
 - Decode Worker publishes GPU metadata to ETCD for coordination
 - PrefillWorker loads metadata to establish direct communication channels
-Original file line number
+Diff line change
@@ Expand Up @@
     ```bash
     helm upgrade --install dynamo-graph ./deploy/helm/chart -n dynamo-cloud -f ./components/backends/vllm/deploy/agg.yaml
+    ```
     ### Installation using Grove
@@ Expand Down @@