From 9caa48ac1cd7c68c8ecf28d7823b52fbc4fa7119 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Mon, 7 Jul 2025 11:06:18 -0700
Subject: [PATCH 01/20] basic readme.md

---
 examples/tensorrt_llm_sd/README.md | 352 +++++++++++++++++++++++++++++
 1 file changed, 352 insertions(+)
 create mode 100644 examples/tensorrt_llm_sd/README.md

diff --git a/examples/tensorrt_llm_sd/README.md b/examples/tensorrt_llm_sd/README.md
new file mode 100644
index 0000000000..f844a56d94
--- /dev/null
+++ b/examples/tensorrt_llm_sd/README.md
@@ -0,0 +1,352 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# LLM Deployment Examples using TensorRT-LLM
+
+This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
+
+## Use the Latest Release
+
+We recommend using the latest stable release of dynamo to avoid breaking changes:
+
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
+
+You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
+
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
+```
+
+## Deployment Architectures
+
+See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
+Note that this TensorRT-LLM version does not support all the options yet.
+
+Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
+
+## Getting Started
+
+1. Choose a deployment architecture based on your requirements
+2. Configure the components as needed
+3. Deploy using the provided scripts
+
+### Prerequisites
+
+Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
+```bash
+docker compose -f deploy/metrics/docker-compose.yml up -d
+```
+
+### Build docker
+
+```bash
+# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
+apt-get update && apt-get -y install git git-lfs
+
+# On an x86 machine:
+./container/build.sh --framework tensorrtllm
+
+# On an ARM machine:
+./container/build.sh --framework tensorrtllm --platform linux/arm64
+
+# Build the container with the default experimental TensorRT-LLM commit
+# WARNING: This is for experimental feature testing only.
+# The container should not be used in a production environment.
+./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit
+```
+
+### Run container
+
+```
+./container/run.sh --framework tensorrtllm -it
+```
+## Run Deployment
+
+This figure shows an overview of the major components to deploy:
+
+
+
+```
+
++------+      +-----------+      +------------------+             +---------------+
+| HTTP |----->| processor |----->|      Worker      |------------>|     Prefill   |
+|      |<-----|           |<-----|                  |<------------|     Worker    |
++------+      +-----------+      +------------------+             +---------------+
+                  |    ^                  |
+       query best |    | return           | publish kv events
+           worker |    | worker_id        v
+                  |    |         +------------------+
+                  |    +---------|     kv-router    |
+                  +------------->|                  |
+                                 +------------------+
+
+```
+
+Note: The above architecture illustrates all the components. The final components
+that get spawned depend upon the chosen graph.
+
+### Example architectures
+
+#### Aggregated serving
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
+```
+
+#### Aggregated serving with KV Routing
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.agg:Frontend -f ./configs/agg_router.yaml
+```
+
+#### Disaggregated serving
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
+```
+
+#### Disaggregated serving with KV Routing
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.disagg:Frontend -f ./configs/disagg_router.yaml
+```
+
+#### Aggregated serving with Multi-Token Prediction (MTP) and DeepSeek R1
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
+```
+
+Notes:
+- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
+
+  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
+
+- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
+- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
+
+#### Multi-Node Disaggregated Serving
+
+In the following example, we will demonstrate how to run a Disaggregated Serving
+deployment across multiple nodes. For simplicity, we will demonstrate how to
+deploy a single Decode worker on one node, and a single Prefill worker on the other node.
+However, the instance counts, TP sizes, other configs, and responsibilities of each node
+can be customized and deployed in similar ways.
+
+For example, to deploy Deepseek R1, you could replace the referenced example
+configs (`configs/agg.yaml`, `configs/disagg.yaml`) with corresponding Deepseek R1
+example configs (`configs/deepseek_r1/agg.yaml`, `configs/deepseek_r1/disagg.yaml`).
+You can find the example Deepseek R1 configs for GB200
+[here](configs/deepseek_r1), but the config settings can be customized for testing
+other hardware configurations or parallelism strategies.
+
+This "multi-node" example demonstrates how to generally connect dynamo workers from
+different nodes, but for simplicity, each worker individually fits on a single node.
+For details on how to launch a worker that spans multiple nodes due to sheer model
+size, or for features like large scale expert parallelism, see the
+[multinode worker example](configs/deepseek_r1/multinode).
+
+##### Head Node
+
+Start nats/etcd:
+```bash
+# NATS data persisted to /tmp/nats/jetstream by default
+nats-server -js &
+
+# Persist data to /tmp/etcd, otherwise defaults to ${PWD}/default.etcd if left unspecified
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
+
+# NOTE: Clearing out the etcd and nats jetstream data directories across runs
+#       helps to guarantee a clean and reproducible results.
+```
+
+Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
+
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.agg:Frontend -f ./configs/disagg.yaml &
+```
+
+Notes:
+- The aggregated graph (`graphs.agg`) is chosen here because it also describes
+  our desired deployment settings for the head node: launching the utility components
+  (Frontend, Processor), and only the decode worker (TensorRTLLMWorker configured with
+  `remote-prefill` enabled). We plan to launch the `TensorRTLLMPrefillWorker`
+  independently on a separate node in the next step of this demonstration.
+  You are free to customize the graph and configuration of components launched on
+  each node.
+- The disaggregated config `configs/disagg.yaml` is intentionally chosen here as a
+  single source of truth to be used for deployments on all of our nodes, describing
+  the configurations for all of our components, including both decode and prefill
+  workers, but can be customized based on your deployment needs.
+
+##### Worker Node(s)
+
+Set environment variables pointing at the etcd/nats endpoints on the head node
+so the Dynamo Distributed Runtime can orchestrate communication and
+discoverability between the head node and worker nodes:
+```bash
+# if not head node
+export HEAD_NODE_IP="<head-node-ip>"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+```
+
+Deploy a Prefill worker:
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
+```
+
+Now you have a 2-node deployment with 1 Decode worker on the head node, and 1 Prefill worker on a worker node!
+
+##### Additional Notes for Multi-Node Deployments
+
+Notes:
+- To include a router in this deployment, change the graph to one that includes the router, such as `graphs.agg_router`,
+  and change the config to one that includes the router, such as `configs/disagg_router.yaml`
+- This step is assuming you're disaggregated serving and planning to launch prefill workers on separate nodes.
+  Howerver, for an aggregated deployment with additional aggregated worker replicas on other nodes, this step
+  remains mostly the same. The primary difference between aggregation and disaggregation for this step is
+  whether or not the `TensorRTLLMWorker` is configured to do `remote-prefill` or not in the config file
+  (ex: `configs/disagg.yaml` vs `configs/agg.yaml`).
+- To apply the same concept for launching additional decode workers on worker nodes, you can
+  directly start them, similar to the prefill worker step above:
+  ```bash
+  # Example: deploy decode worker only
+  cd /workspace/examples/tensorrt_llm
+  dynamo serve components.worker:TensorRTLLMWorker -f ./configs/disagg.yaml --service-name TensorRTLLMWorker &
+  ```
+- If you see an error about MPI Spawn failing during TRTLLM Worker initialziation on a Slurm-based cluster,
+  try unsetting the following environment variables before launching the TRTLLM worker. If you intend to
+  run other slurm-based commands or processes on the same node after deploying the TRTLLM worker, you may
+  want to save these values into temporary variables and then restore them afterwards.
+  ```bash
+  # Workaround for error: `mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes`
+  unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
+  ```
+
+#### Multi-Node Disaggregated Serving with Multi-Token Prediction (MTP) and DeepSeek R1
+
+Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations
+
+##### Head Node
+
+Start nats/etcd
+```bash
+nats-server -js &
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
+```
+
+Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
+
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_disagg.yaml  &
+```
+
+##### Worker Node(s)
+
+Set environment variables pointing at the etcd/nats endpoints on the head node.
+```bash
+export HEAD_NODE_IP="<head-node-ip>"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+```
+
+Deploy a Prefill worker:
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/deepseek_r1/mtp/mtp_disagg.yaml --service-name TensorRTLLMPrefillWorker &
+```
+
+Notes:
+- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
+
+  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
+- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
+- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
+
+
+### Client
+
+See [client](../llm/README.md#client) section to learn how to send request to the deployment.
+
+NOTE: To send a request to a multi-node deployment, target the node which deployed the `Frontend` component.
+
+### Close deployment
+
+See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
+
+### Benchmarking
+
+To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
+`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
+
+
+### KV Cache Transfer for Disaggregated Serving
+
+In disaggregated serving architectures, KV cache must be transferred between prefill and decode nodes. TensorRT-LLM supports two methods for this transfer:
+
+#### Default Method: UCX
+By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode nodes. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
+
+#### Experimental Method: NIXL
+TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
+
+**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet.
+
+#### Using NIXL for KV Cache Transfer
+
+**Note:** NIXL backend for TensorRT-LLM is currently only supported on AMD64 (x86_64) architecture. If you're running on ARM64, you'll need to use the default UCX method for KV cache transfer.
+
+To enable NIXL for KV cache transfer in disaggregated serving:
+
+1. **Build the container with NIXL support:**
+   The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support.
+
+   **Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):**
+   ```bash
+   rm -rf /tmp/trtllm_wheel
+   ```
+
+   **Build the container with NIXL support:**
+   ```bash
+   ./container/build.sh --framework tensorrtllm \
+     --use-default-experimental-tensorrtllm-commit \
+     --trtllm-use-nixl-kvcache-experimental
+   ```
+
+   **Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support.
+
+2. **Run the containerized environment:**
+   See [run container](#run-container) section to learn how to start the container image built in previous step.
+
+3. **Start the disaggregated service:**
+   See [disaggregated serving](#disaggregated-serving) to see how to start the deployment.
+
+4. **Send the request:**
+   See [client](#client) section to learn how to send the request to deployment.
+
+**Important:** Ensure that ETCD and NATS services are running before starting the service.
+
+The container will automatically configure the appropriate environment variables (`TRTLLM_USE_NIXL_KVCACHE=1`) when built with the NIXL flag. The same container image can be used to use UCX for KV cache transfer.
+```bash
+unset TRTLLM_USE_NIXL_KVCACHE
+export TRTLLM_USE_UCX_KVCACHE=1
+```
+

From bf4c6bbbbe750d874ec60b2b62690bf4d1c71a89 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Mon, 7 Jul 2025 15:02:51 -0700
Subject: [PATCH 02/20] Copied over previous tutorial

---
 examples/tensorrt_llm_sd/__init__.py          |  14 +
 examples/tensorrt_llm_sd/common/__init__.py   |   0
 .../tensorrt_llm_sd/common/base_engine.py     | 389 ++++++++++++++++++
 examples/tensorrt_llm_sd/common/parser.py     |  62 +++
 examples/tensorrt_llm_sd/common/protocol.py   | 104 +++++
 .../tensorrt_llm_sd/components/frontend.py    | 119 ++++++
 .../components/prefill_worker.py              |  75 ++++
 examples/tensorrt_llm_sd/components/worker.py | 115 ++++++
 examples/tensorrt_llm_sd/configs/agg.yaml     |  34 ++
 .../tensorrt_llm_sd/configs/agg_router.yaml   |  34 ++
 .../configs/deepseek_r1/agg.yaml              |  35 ++
 .../configs/deepseek_r1/disagg.yaml           |  49 +++
 .../engine_configs/agg_config.yaml            |  54 +++
 .../engine_configs/decode_config.yaml         |  55 +++
 .../engine_configs/prefill_config.yaml        |  37 ++
 .../mtp/engine_configs/agg_config.yaml        |  50 +++
 .../mtp/engine_configs/decode_config.yaml     |  53 +++
 .../mtp/engine_configs/prefill_config.yaml    |  37 ++
 .../configs/deepseek_r1/mtp/mtp_agg.yaml      |  36 ++
 .../configs/deepseek_r1/mtp/mtp_disagg.yaml   |  52 +++
 .../configs/deepseek_r1/multinode/README.md   | 275 +++++++++++++
 .../multinode/engine_configs/dep16_agg.yaml   |  27 ++
 .../multinode/engine_configs/eplb.yaml        |   7 +
 .../multinode/engine_configs/wide_ep_agg.yaml |  35 ++
 .../engine_configs/wide_ep_decode.yaml        |  59 +++
 .../engine_configs/wide_ep_prefill.yaml       |  41 ++
 .../deepseek_r1/multinode/srun_aggregated.sh  |  75 ++++
 .../multinode/srun_disaggregated.sh           |  94 +++++
 .../multinode/start_frontend_services.sh      |  16 +
 .../multinode/start_trtllm_worker.sh          |  46 +++
 examples/tensorrt_llm_sd/configs/disagg.yaml  |  48 +++
 .../configs/disagg_router.yaml                |  47 +++
 .../configs/engine_configs/agg_config.yaml    |  31 ++
 .../configs/engine_configs/decode_config.yaml |  27 ++
 .../engine_configs/prefill_config.yaml        |  28 ++
 examples/tensorrt_llm_sd/graphs/agg.py        |  19 +
 examples/tensorrt_llm_sd/graphs/disagg.py     |  20 +
 37 files changed, 2299 insertions(+)
 create mode 100644 examples/tensorrt_llm_sd/__init__.py
 create mode 100644 examples/tensorrt_llm_sd/common/__init__.py
 create mode 100644 examples/tensorrt_llm_sd/common/base_engine.py
 create mode 100644 examples/tensorrt_llm_sd/common/parser.py
 create mode 100644 examples/tensorrt_llm_sd/common/protocol.py
 create mode 100644 examples/tensorrt_llm_sd/components/frontend.py
 create mode 100644 examples/tensorrt_llm_sd/components/prefill_worker.py
 create mode 100644 examples/tensorrt_llm_sd/components/worker.py
 create mode 100644 examples/tensorrt_llm_sd/configs/agg.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/agg_router.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml
 create mode 100755 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh
 create mode 100755 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh
 create mode 100755 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh
 create mode 100755 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh
 create mode 100644 examples/tensorrt_llm_sd/configs/disagg.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/disagg_router.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml
 create mode 100644 examples/tensorrt_llm_sd/graphs/agg.py
 create mode 100644 examples/tensorrt_llm_sd/graphs/disagg.py

diff --git a/examples/tensorrt_llm_sd/__init__.py b/examples/tensorrt_llm_sd/__init__.py
new file mode 100644
index 0000000000..3159bfe656
--- /dev/null
+++ b/examples/tensorrt_llm_sd/__init__.py
@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/examples/tensorrt_llm_sd/common/__init__.py b/examples/tensorrt_llm_sd/common/__init__.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/examples/tensorrt_llm_sd/common/base_engine.py b/examples/tensorrt_llm_sd/common/base_engine.py
new file mode 100644
index 0000000000..3df95b490c
--- /dev/null
+++ b/examples/tensorrt_llm_sd/common/base_engine.py
@@ -0,0 +1,389 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+from dataclasses import dataclass
+from typing import Any, Optional
+
+from common.protocol import DisaggregatedTypeConverter, TRTLLMWorkerRequest
+from tensorrt_llm import SamplingParams
+from tensorrt_llm.llmapi.llm_utils import update_llm_args_with_extra_options
+from tensorrt_llm.llmapi.tokenizer import tokenizer_factory
+from tensorrt_llm.serve.openai_protocol import (
+    DisaggregatedParams as OAIDisaggregatedParams,
+)
+
+from dynamo.llm import get_tensorrtllm_engine, get_tensorrtllm_publisher
+from dynamo.runtime import DistributedRuntime
+
+logger = logging.getLogger(__name__)
+
+logger.setLevel(logging.DEBUG)
+
+# Default buffer size for kv cache events.
+DEFAULT_KV_EVENT_BUFFER_MAX_SIZE = 1024
+
+
+def parse_endpoint(endpoint: str) -> tuple[str, str, str]:
+    endpoint_str = endpoint.replace("dyn://", "", 1)
+    endpoint_parts = endpoint_str.split(".")
+    if len(endpoint_parts) != 3:
+        raise ValueError(
+            f"Invalid endpoint format: '{endpoint}'. "
+            "Expected 'dyn://namespace.component.endpoint' or 'namespace.component.endpoint'."
+        )
+
+    return (endpoint_parts[0], endpoint_parts[1], endpoint_parts[2])
+
+
+@dataclass
+class BaseEngineConfig:
+    """Base engine configuration"""
+
+    namespace: str
+    component: str
+    endpoint: str
+    model_path: str
+    served_model_name: Optional[str] = None
+    kv_block_size: int = 32
+    extra_engine_args: str = ""
+    publish_events_and_metrics: bool = False
+    disaggregation_mode: str = "prefill_and_decode"
+    remote_prefill_endpoint: Optional[str] = None
+    lease_id: int = 0
+
+    def __str__(self) -> str:
+        return (
+            f"Config(namespace={self.namespace}, "
+            f"component={self.component}, "
+            f"endpoint={self.endpoint}, "
+            f"model_path={self.model_path}, "
+            f"served_model_name={self.served_model_name}, "
+            f"kv_block_size={self.kv_block_size}, "
+            f"extra_engine_args={self.extra_engine_args}, "
+            f"publish_events_and_metrics={self.publish_events_and_metrics}, "
+            f"disaggregation_mode={self.disaggregation_mode}, "
+            f"remote_prefill_endpoint={self.remote_prefill_endpoint}, "
+            f"lease_id={self.lease_id})"
+        )
+
+
+class BaseTensorrtLLMEngine:
+    def __init__(
+        self,
+        config: BaseEngineConfig,
+    ):
+        self._config = config
+        self._prefill_client = None
+        self._llm_engine = None
+        self._llm_engine_context = None
+        self._llm_publisher = None
+        self._llm_publisher_context = None
+        self._runtime = None
+        self._first_generation = True
+        # Initialize default sampling params
+        self.default_sampling_params = SamplingParams()
+
+    async def initialize(self, runtime: DistributedRuntime):
+        """Initialize the engine and prefill client if needed"""
+        self._runtime = runtime
+
+        # Convert model path to Path object if it's a local path, otherwise keep as string
+        model_path = str(self._config.model_path)
+
+        # Initialize the LLM engine
+        engine_args: dict[str, Any] = {
+            "model": model_path,
+            "tensor_parallel_size": 1,
+            "backend": "pytorch",
+            "skip_tokenizer_init": True,
+        }
+
+        if self._config.extra_engine_args:
+            # TODO: Support extra engine args from json file as well.
+            engine_args = update_llm_args_with_extra_options(
+                engine_args, self._config.extra_engine_args
+            )
+        # Update the model path in the config to the model path used by the engine.
+        self._config.model_path = str(engine_args["model"])
+        if not self._config.model_path:
+            raise ValueError(
+                "Model specification is required. Present neither in the config nor in the extra engine args."
+            )
+
+        # Populate default sampling params from the model
+        tokenizer = tokenizer_factory(self._config.model_path)
+        self.default_sampling_params = SamplingParams()
+        self.default_sampling_params._setup(tokenizer)
+        self.default_sampling_params.stop = None
+
+        if self._config.publish_events_and_metrics:
+            # 'event_buffer_max_size' is required to enable TRTLLM to publish kv cache events.
+            kv_cache_config: dict[str, Any] | Any = None
+            if "kv_cache_config" not in engine_args:
+                kv_cache_config = {}
+                kv_cache_config[
+                    "event_buffer_max_size"
+                ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
+            else:
+                kv_cache_config = engine_args["kv_cache_config"]
+                if (
+                    hasattr(kv_cache_config, "event_buffer_max_size")
+                    and not kv_cache_config.event_buffer_max_size
+                ):
+                    kv_cache_config.event_buffer_max_size = (
+                        DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
+                    )
+                elif (
+                    isinstance(kv_cache_config, dict)
+                    and "event_buffer_max_size" not in kv_cache_config
+                ):
+                    kv_cache_config[
+                        "event_buffer_max_size"
+                    ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
+                engine_args["kv_cache_config"] = kv_cache_config
+
+            # Enable iter perf stats by default if we are publishing events and metrics.
+            if not engine_args.get("enable_iter_perf_stats"):
+                engine_args["enable_iter_perf_stats"] = True
+
+            # Only pytorch backend is supported for now to publish events and metrics.
+            if engine_args.get("backend") != "pytorch":
+                logging.error(
+                    "Only pytorch backend is supported for now to publish events and metrics."
+                )
+                raise RuntimeError(
+                    "Only pytorch backend is supported for now to publish events and metrics. Hence, KV router is not supported."
+                )
+
+        logging.info(f"TRTLLM engine args: {engine_args}")
+
+        # Get the engine using the asynccontextmanager
+        self._llm_engine_context = get_tensorrtllm_engine(engine_args)
+        if self._llm_engine_context is not None:
+            self._llm_engine = await self._llm_engine_context.__aenter__()
+        else:
+            raise RuntimeError("Failed to create LLM engine context")
+
+        if (
+            self._config.publish_events_and_metrics
+            and self._config.disaggregation_mode != "prefill"
+        ):
+            kv_listener = runtime.namespace(self._config.namespace).component(
+                self._config.component
+            )
+            self._llm_publisher_context = get_tensorrtllm_publisher(
+                kv_listener,
+                self._llm_engine,
+                kv_listener,
+                self._config.lease_id,
+                self._config.kv_block_size,
+            )
+            if self._llm_publisher_context is not None:
+                self._llm_publisher = await self._llm_publisher_context.__aenter__()
+            else:
+                raise RuntimeError("Failed to create LLM publisher context")
+
+        # Initialize prefill client if in decode mode
+        if self._config.disaggregation_mode == "decode":
+            if self._config.remote_prefill_endpoint is None:
+                raise ValueError("remote_prefill_endpoint is required for decode mode")
+            logging.info(
+                f"Initializing remote prefill client for endpoint: {self._config.remote_prefill_endpoint}"
+            )
+            (
+                parsed_namespace,
+                parsed_component_name,
+                parsed_endpoint_name,
+            ) = parse_endpoint(self._config.remote_prefill_endpoint)
+            if self._runtime is not None:
+                self._prefill_client = (
+                    await self._runtime.namespace(parsed_namespace)
+                    .component(parsed_component_name)
+                    .endpoint(parsed_endpoint_name)
+                    .client()
+                )
+            else:
+                raise RuntimeError("Runtime not initialized")
+
+    async def cleanup(self):
+        """Cleanup resources"""
+        if self._llm_publisher_context:
+            try:
+                await self._llm_publisher_context.__aexit__(None, None, None)
+            except Exception as e:
+                logging.error(f"Error during publisher cleanup: {e}")
+            finally:
+                self._llm_publisher = None
+                self._llm_publisher_context = None
+
+        if self._llm_engine_context:
+            try:
+                await self._llm_engine_context.__aexit__(None, None, None)
+            except Exception as e:
+                logging.error(f"Error during engine cleanup: {e}")
+            finally:
+                self._llm_engine = None
+                self._llm_engine_context = None
+
+        self._prefill_client = None
+
+    async def remote_prefill(self, request: TRTLLMWorkerRequest):
+        """
+        Send a prefill request to the remote prefill worker.
+
+        Args:
+            request: The original request to be sent for prefill
+
+        Returns:
+            The response from the remote prefill worker
+
+        Raises:
+            ValueError: If prefill client is not initialized or multiple responses received
+        """
+        prefill_request = request.model_copy(deep=True)
+        # TRTLLM requires max_tokens to be set for prefill requests.
+        prefill_request.stop_conditions.max_tokens = 1
+        prefill_request.disaggregated_params = OAIDisaggregatedParams(
+            request_type="context_only"
+        )
+
+        if self._prefill_client is None:
+            raise ValueError("Prefill client not initialized")
+        try:
+            # TODO: Use smart KV router to determine which prefill worker to use. This would also require supporting publishing events for prefill workers.
+            remote_prefill_responses = [
+                remote_prefill_response
+                async for remote_prefill_response in await self._prefill_client.round_robin(
+                    prefill_request.model_dump_json()
+                )
+            ]
+        except Exception as e:
+            raise ValueError(f"Error in remote prefill: {e}")
+
+        if len(remote_prefill_responses) > 1:
+            raise ValueError(
+                "Prefill worker returned more than one response. This is currently not supported in remote prefill mode."
+            )
+
+        if len(remote_prefill_responses) == 0:
+            raise ValueError("No response received from remote prefill worker")
+
+        remote_prefill_response = remote_prefill_responses[0]
+        return remote_prefill_response
+
+    async def generate(self, request: TRTLLMWorkerRequest):
+        if self._llm_engine is None:
+            raise RuntimeError("Engine not initialized")
+
+        if self._llm_publisher:
+            publishers_error = self._llm_publisher.check_error_queue()
+            if publishers_error:
+                raise publishers_error
+
+        inputs = request.token_ids
+
+        # Decode the disaggregated params from the request
+        disaggregated_params = DisaggregatedTypeConverter.to_llm_disaggregated_params(
+            request.disaggregated_params
+        )
+        num_output_tokens_so_far = 0
+
+        if self._config.disaggregation_mode == "decode":
+            # Run prefill/context phase remotely if disaggregation mode is decode.
+            try:
+                prefill_result = await self.remote_prefill(request)
+            except Exception as e:
+                raise ValueError(f"Error in remote prefill: {e}")
+
+            remote_prefill_response = prefill_result.data()
+            if (
+                remote_prefill_response["finish_reason"] == "stop"
+                or remote_prefill_response["finish_reason"] == "error"
+            ):
+                yield remote_prefill_response
+                return
+            num_output_tokens_so_far = len(remote_prefill_response["token_ids"])
+
+            # Decode the disaggregated params from the remote prefill response
+            # Decode the disaggregated params from the remote prefill response
+            disaggregated_params = (
+                DisaggregatedTypeConverter.to_llm_disaggregated_params(
+                    OAIDisaggregatedParams(
+                        **remote_prefill_response["disaggregated_params"]
+                    )
+                )
+            )
+
+            # Send the first token response to the client
+            first_token_response = remote_prefill_response
+            first_token_response.pop("disaggregated_params")
+            yield first_token_response
+
+            # Set the disaggregated params to generation_only for the rest of the generation
+            disaggregated_params.request_type = "generation_only"
+
+        sampling_params = self.default_sampling_params
+        for key, value in request.sampling_options.model_dump().items():
+            if not value:
+                continue
+            if hasattr(sampling_params, key):
+                setattr(sampling_params, key, value)
+
+        max_tokens = request.stop_conditions.max_tokens
+        if max_tokens:
+            sampling_params.max_tokens = max_tokens
+
+        ignore_eos = request.stop_conditions.ignore_eos
+        if ignore_eos:
+            sampling_params.ignore_eos = ignore_eos
+
+        # TODO: Disable streaming for context only requests when adding disagg support
+        async for res in self._llm_engine.llm.generate_async(
+            inputs=inputs,
+            sampling_params=sampling_params,
+            disaggregated_params=disaggregated_params,
+            streaming=(self._config.disaggregation_mode != "prefill"),
+        ):
+            # TRTLLM engine needs to start generating tokens first before stats
+            # can be retrieved.
+            if self._first_generation and self._llm_publisher:
+                self._llm_publisher.start()
+                self._first_generation = False
+
+            if res.finished and self._config.disaggregation_mode != "prefill":
+                yield {"finish_reason": "stop", "token_ids": []}
+                break
+
+            if not res.outputs:
+                yield {"finish_reason": "error", "token_ids": []}
+                break
+
+            output = res.outputs[0]
+            next_total_toks = len(output.token_ids)
+            out = {"token_ids": output.token_ids[num_output_tokens_so_far:]}
+            if output.finish_reason:
+                out["finish_reason"] = output.finish_reason
+            if output.stop_reason:
+                out["stop_reason"] = output.stop_reason
+            if self._config.disaggregation_mode == "prefill":
+                # Return the disaggregated params only when operating in prefill mode.
+                out[
+                    "disaggregated_params"
+                ] = DisaggregatedTypeConverter.to_oai_disaggregated_params(
+                    output.disaggregated_params
+                ).model_dump()
+
+            yield out
+            num_output_tokens_so_far = next_total_toks
diff --git a/examples/tensorrt_llm_sd/common/parser.py b/examples/tensorrt_llm_sd/common/parser.py
new file mode 100644
index 0000000000..67bb230796
--- /dev/null
+++ b/examples/tensorrt_llm_sd/common/parser.py
@@ -0,0 +1,62 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+def parse_tensorrt_llm_args(
+    config_args,
+) -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="A TensorRT-LLM Worker parser")
+    parser.add_argument(
+        "--extra-engine-args",
+        type=str,
+        default="",
+        help="Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.",
+    )
+    parser.add_argument(
+        "--model-path",
+        type=str,
+        default=None,
+        help="Path to disk model or HuggingFace model identifier to load.",
+    )
+    parser.add_argument(
+        "--served_model_name",
+        type=str,
+        help="Name to serve the model under.",
+    )
+    parser.add_argument(
+        "--router",
+        type=str,
+        choices=["random", "round-robin", "kv"],
+        default="random",
+        help="Router type to use for scheduling requests to workers",
+    )
+
+    parser.add_argument(
+        "--kv-block-size",
+        type=int,
+        default=32,
+        help="Number of tokens per KV block in TRTLLM worker. Default is 32 for pytorch backend.",
+    )
+
+    parser.add_argument(
+        "--enable-disagg",
+        action="store_true",
+        help="Enable remote prefill for the worker",
+    )
+
+    args = parser.parse_args(config_args)
+    return args
diff --git a/examples/tensorrt_llm_sd/common/protocol.py b/examples/tensorrt_llm_sd/common/protocol.py
new file mode 100644
index 0000000000..f05cdb9f8f
--- /dev/null
+++ b/examples/tensorrt_llm_sd/common/protocol.py
@@ -0,0 +1,104 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import base64
+from typing import List, Optional
+
+from pydantic import BaseModel, Field
+from tensorrt_llm.llmapi import DisaggregatedParams as LlmDisaggregatedParams
+from tensorrt_llm.serve.openai_protocol import DisaggregatedParams
+
+
+class Tokens(BaseModel):
+    tokens: list[int]
+
+
+TokenIdType = int
+
+
+class DisaggregatedTypeConverter:
+    @staticmethod
+    def to_llm_disaggregated_params(
+        disaggregated_params: DisaggregatedParams,
+    ) -> LlmDisaggregatedParams:
+        if disaggregated_params is None:
+            return None
+        else:
+            opaque_state = (
+                base64.b64decode(disaggregated_params.encoded_opaque_state)
+                if disaggregated_params.encoded_opaque_state is not None
+                else None
+            )
+
+            return LlmDisaggregatedParams(
+                request_type=disaggregated_params.request_type,
+                first_gen_tokens=disaggregated_params.first_gen_tokens,
+                ctx_request_id=disaggregated_params.ctx_request_id,
+                opaque_state=opaque_state,
+            )
+
+    @staticmethod
+    def to_oai_disaggregated_params(
+        tllm_disagg_params: LlmDisaggregatedParams,
+    ) -> DisaggregatedParams:
+        if tllm_disagg_params is None:
+            return None
+        else:
+            encoded_opaque_state = (
+                base64.b64encode(tllm_disagg_params.opaque_state).decode("utf-8")
+                if tllm_disagg_params.opaque_state is not None
+                else None
+            )
+            return DisaggregatedParams(
+                request_type=tllm_disagg_params.request_type,
+                first_gen_tokens=tllm_disagg_params.first_gen_tokens,
+                ctx_request_id=tllm_disagg_params.ctx_request_id,
+                encoded_opaque_state=encoded_opaque_state,
+            )
+
+
+# TODO: move these to common for all LLMs once we adopt dynamo-run
+# derived from lib/llm/src/protocols/common/preprocessor.rs
+class StopConditions(BaseModel):
+    max_tokens: Optional[int] = None
+    stop: Optional[List[str]] = None
+    stop_token_ids_hidden: Optional[List[TokenIdType]] = None
+    min_tokens: Optional[int] = None
+    ignore_eos: Optional[bool] = None
+
+
+class SamplingOptions(BaseModel):
+    n: Optional[int] = None
+    best_of: Optional[int] = None
+    presence_penalty: Optional[float] = None
+    frequency_penalty: Optional[float] = None
+    repetition_penalty: Optional[float] = None
+    temperature: Optional[float] = None
+    top_p: Optional[float] = None
+    top_k: Optional[int] = None
+    min_p: Optional[float] = None
+    use_beam_search: Optional[bool] = None
+    length_penalty: Optional[float] = None
+    seed: Optional[int] = None
+
+
+class TRTLLMWorkerRequest(BaseModel):
+    token_ids: List[TokenIdType]
+    stop_conditions: StopConditions
+    sampling_options: SamplingOptions
+    eos_token_ids: List[TokenIdType] = Field(default_factory=list)
+    mdc_sum: Optional[str] = None
+    annotations: List[str] = Field(default_factory=list)
+    estimated_prefix_hit_num_blocks: Optional[int] = None
+    disaggregated_params: Optional[DisaggregatedParams] = Field(default=None)
diff --git a/examples/tensorrt_llm_sd/components/frontend.py b/examples/tensorrt_llm_sd/components/frontend.py
new file mode 100644
index 0000000000..98be2dfa33
--- /dev/null
+++ b/examples/tensorrt_llm_sd/components/frontend.py
@@ -0,0 +1,119 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import subprocess
+from pathlib import Path
+
+from components.worker import TensorRTLLMWorker
+from fastapi import FastAPI
+from pydantic import BaseModel
+
+from dynamo import sdk
+from dynamo.sdk import depends, service
+from dynamo.sdk.lib.config import ServiceConfig
+from dynamo.sdk.lib.image import DYNAMO_IMAGE
+
+logger = logging.getLogger(__name__)
+
+
+def get_dynamo_run_binary():
+    """Find the dynamo-run binary path in SDK or fallback to 'dynamo-run' command."""
+    sdk_path = Path(sdk.__file__)
+    binary_path = sdk_path.parent / "cli/bin/dynamo-run"
+    if not binary_path.exists():
+        return "dynamo-run"
+    else:
+        return str(binary_path)
+
+
+class FrontendConfig(BaseModel):
+    """Configuration for the Frontend service including model and HTTP server settings."""
+
+    served_model_name: str
+    endpoint: str
+    port: int = 8000
+    router: str = "round-robin"
+    block_size: int = 32
+
+
+# todo this should be called ApiServer
+@service(
+    dynamo={
+        "namespace": "dynamo",
+    },
+    workers=1,
+    image=DYNAMO_IMAGE,
+    app=FastAPI(title="TensorRT-LLM Example"),
+)
+class Frontend:
+    worker = depends(TensorRTLLMWorker)
+
+    def __init__(self):
+        """Initialize Frontend service with HTTP server and model configuration."""
+        self.frontend_config = FrontendConfig(
+            **ServiceConfig.get_parsed_config("Frontend")
+        )
+        self.process = None
+
+        logger.warning(f"Frontend config: {self.frontend_config}")
+
+        self.start_ingress_and_processor()
+
+    def start_ingress_and_processor(self):
+        """Starting dynamo-run based ingress and processor"""
+        logger.info(
+            f"Starting HTTP server and processor on port {self.frontend_config.port}"
+        )
+        dynamo_run_binary = get_dynamo_run_binary()
+
+        cmd = [
+            dynamo_run_binary,
+            "in=http",
+            "out=dyn",
+            "--http-port",
+            str(self.frontend_config.port),
+            "--router-mode",
+            self.frontend_config.router,
+        ]
+
+        logger.info(f"Frontend cmd: {cmd}")
+
+        self.process = subprocess.Popen(
+            cmd,
+            stdout=None,
+            stderr=None,
+        )
+
+    def close(self):
+        """Clean up resources by terminating the subprocess."""
+        if self.process is not None:
+            try:
+                logger.info("Terminating subprocess...")
+                self.process.terminate()
+                # Wait for process to terminate with a timeout
+                self.process.wait(timeout=5)
+            except subprocess.TimeoutExpired:
+                logger.warning("Subprocess did not terminate gracefully, forcing kill")
+                self.process.kill()
+                self.process.wait()
+            except Exception as e:
+                logger.error(f"Error while terminating subprocess: {e}")
+            finally:
+                self.process = None
+
+    def __del__(self):
+        """Destructor to ensure subprocess is cleaned up."""
+        self.close()
diff --git a/examples/tensorrt_llm_sd/components/prefill_worker.py b/examples/tensorrt_llm_sd/components/prefill_worker.py
new file mode 100644
index 0000000000..7e43d1fca7
--- /dev/null
+++ b/examples/tensorrt_llm_sd/components/prefill_worker.py
@@ -0,0 +1,75 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+
+from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine
+from common.parser import parse_tensorrt_llm_args
+from common.protocol import TRTLLMWorkerRequest
+
+from dynamo.sdk import async_on_start, dynamo_context, endpoint, on_shutdown, service
+from dynamo.sdk.lib.config import ServiceConfig
+
+logger = logging.getLogger(__name__)
+
+
+@service(
+    dynamo={
+        "namespace": "dynamo",
+    },
+    resources={"gpu": 1, "cpu": "10", "memory": "20Gi"},
+    workers=1,
+)
+class TensorRTLLMPrefillWorker(BaseTensorrtLLMEngine):
+    def __init__(self):
+        logger.info("Initializing TensorRT-LLM Prefill Worker")
+        class_name = self.__class__.__name__
+        config = ServiceConfig.get_instance()
+        config_args = config.as_args(class_name, prefix="")
+        args = parse_tensorrt_llm_args(config_args)
+        lease_id = dynamo_context["endpoints"][0].lease_id()
+        namespace, _ = TensorRTLLMPrefillWorker.dynamo_address()  # type: ignore
+
+        engine_config = BaseEngineConfig(
+            namespace=namespace,
+            component=class_name,
+            endpoint="generate",
+            model_path=args.model_path,
+            served_model_name=args.served_model_name,
+            kv_block_size=args.kv_block_size,
+            extra_engine_args=args.extra_engine_args,
+            publish_events_and_metrics=False,
+            disaggregation_mode="prefill",
+            remote_prefill_endpoint=None,
+            lease_id=lease_id,
+        )
+
+        super().__init__(config=engine_config)
+
+    @async_on_start
+    async def async_init(self):
+        runtime = dynamo_context["runtime"]
+        await self.initialize(runtime)
+        logger.info("TensorRT-LLM Prefill Worker initialized")
+
+    @on_shutdown
+    async def async_cleanup(self):
+        logger.info("Cleaning up TensorRT-LLM Prefill Worker")
+        await self.cleanup()
+        logger.info("TensorRT-LLM Prefill Worker cleanup completed")
+
+    @endpoint()
+    async def generate(self, request: TRTLLMWorkerRequest):
+        async for response in super().generate(request):
+            yield response
diff --git a/examples/tensorrt_llm_sd/components/worker.py b/examples/tensorrt_llm_sd/components/worker.py
new file mode 100644
index 0000000000..9074bfbe8d
--- /dev/null
+++ b/examples/tensorrt_llm_sd/components/worker.py
@@ -0,0 +1,115 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+
+from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine
+from common.parser import parse_tensorrt_llm_args
+from common.protocol import TRTLLMWorkerRequest
+from components.prefill_worker import TensorRTLLMPrefillWorker
+
+from dynamo.llm import ModelType, register_llm
+from dynamo.sdk import (
+    async_on_start,
+    depends,
+    dynamo_context,
+    endpoint,
+    on_shutdown,
+    service,
+)
+from dynamo.sdk.lib.config import ServiceConfig
+
+logger = logging.getLogger(__name__)
+
+
+@service(
+    dynamo={
+        "namespace": "dynamo",
+    },
+    resources={"gpu": 1, "cpu": "10", "memory": "20Gi"},
+    workers=1,
+)
+class TensorRTLLMWorker(BaseTensorrtLLMEngine):
+    prefill_worker = depends(TensorRTLLMPrefillWorker)
+
+    def __init__(self):
+        logger.info("Initializing TensorRT-LLM Worker")
+        class_name = self.__class__.__name__
+        config = ServiceConfig.get_instance()
+        config_args = config.as_args(class_name, prefix="")
+        args = parse_tensorrt_llm_args(config_args)
+        lease_id = dynamo_context["endpoints"][0].lease_id()
+        namespace, _ = TensorRTLLMWorker.dynamo_address()  # type: ignore
+        endpoint_name = "generate"
+        publish_events_and_metrics = args.router == "kv"
+        prefill_class_name = "TensorRTLLMPrefillWorker"
+
+        if args.enable_disagg:
+            disaggregation_mode = "decode"
+        else:
+            disaggregation_mode = "prefill_and_decode"
+
+        engine_config = BaseEngineConfig(
+            namespace=namespace,
+            component=class_name,
+            endpoint=endpoint_name,
+            model_path=args.model_path,
+            served_model_name=args.served_model_name,
+            kv_block_size=args.kv_block_size,
+            extra_engine_args=args.extra_engine_args,
+            publish_events_and_metrics=publish_events_and_metrics,
+            disaggregation_mode=disaggregation_mode,
+            remote_prefill_endpoint=f"dyn://{namespace}.{prefill_class_name}.generate",
+            lease_id=lease_id,
+        )
+
+        super().__init__(config=engine_config)
+
+    @async_on_start
+    async def async_init(self):
+        runtime = dynamo_context["runtime"]
+        await self.initialize(runtime)
+
+        logger.info("Registering LLM for discovery")
+        endpoint = (
+            runtime.namespace(self._config.namespace)
+            .component(self._config.component)
+            .endpoint(self._config.endpoint)
+        )
+
+        try:
+            await register_llm(
+                ModelType.Backend,
+                endpoint,
+                self._config.model_path,
+                self._config.served_model_name,
+                kv_cache_block_size=self._config.kv_block_size,
+            )
+            logger.info("Successfully registered LLM for discovery")
+        except Exception as e:
+            logger.error(f"Failed to register LLM for discovery: {e}")
+            raise
+
+        logger.info("TensorRT-LLM Worker initialized")
+
+    @on_shutdown
+    async def async_cleanup(self):
+        logger.info("Cleaning up TensorRT-LLM Worker")
+        await self.cleanup()
+        logger.info("TensorRT-LLM Worker cleanup completed")
+
+    @endpoint()
+    async def generate(self, request: TRTLLMWorkerRequest):
+        async for response in super().generate(request):
+            yield response
diff --git a/examples/tensorrt_llm_sd/configs/agg.yaml b/examples/tensorrt_llm_sd/configs/agg.yaml
new file mode 100644
index 0000000000..a3d4594ed8
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/agg.yaml
@@ -0,0 +1,34 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: round-robin
+
+TensorRTLLMWorker:
+  # Path to disk model or HuggingFace model identifier to load
+  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Name to serve the model under
+  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/engine_configs/agg_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 1
\ No newline at end of file
diff --git a/examples/tensorrt_llm_sd/configs/agg_router.yaml b/examples/tensorrt_llm_sd/configs/agg_router.yaml
new file mode 100644
index 0000000000..58f2a82ab3
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/agg_router.yaml
@@ -0,0 +1,34 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: kv
+
+TensorRTLLMWorker:
+  # Path to disk model or HuggingFace model identifier to load
+  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Name to serve the model under
+  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/engine_configs/agg_config.yaml"
+  router: kv
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 1
\ No newline at end of file
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml
new file mode 100644
index 0000000000..f7cec35e7d
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml
@@ -0,0 +1,35 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  # This is the client-facing model name, you can set this to anything you'd like.
+  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: round-robin
+
+TensorRTLLMWorker:
+  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  # NOTE: FP4 only supported starting with Blackwell GPUs.
+  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
+  # You can also specify the full path to locally downloaded weights
+  # instead of a HuggingFace ID here.
+  model-path: "nvidia/DeepSeek-R1-FP4"
+  extra-engine-args: "configs/deepseek_r1/engine_configs/agg_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml
new file mode 100644
index 0000000000..9d96befbe5
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml
@@ -0,0 +1,49 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  # This is the client-facing model name, you can set this to anything you'd like.
+  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: round-robin
+
+TensorRTLLMWorker:
+  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  # NOTE: FP4 only supported starting with Blackwell GPUs.
+  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
+  # You can also specify the full path to locally downloaded weights
+  # instead of a HuggingFace ID here.
+  model-path: "nvidia/DeepSeek-R1-FP4"
+  extra-engine-args: "configs/deepseek_r1/engine_configs/decode_config.yaml"
+  enable-disagg: true
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4
+
+TensorRTLLMPrefillWorker:
+  # NOTE: FP4 only supported starting with Blackwell GPUs.
+  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
+  # You can also specify the full path to locally downloaded weights
+  # instead of a HuggingFace ID here.
+  model-path: "nvidia/DeepSeek-R1-FP4"
+  extra-engine-args: "configs/deepseek_r1/engine_configs/prefill_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml
new file mode 100644
index 0000000000..29dddba56f
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml
@@ -0,0 +1,54 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+backend: pytorch
+
+# TP/EP/PP/DP
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+pipeline_parallel_size: 1
+enable_attention_dp: false
+
+max_batch_size: 256
+# 8448 = 8192 ISL + 256 OSL
+max_num_tokens: 8448
+max_seq_len: 8448
+
+kv_cache_config:
+  # With dp attention disabled: high free_gpu_memory_fraction is fine.
+  free_gpu_memory_fraction: 0.85
+  # With dp attention enabled: large ISL at high concurrency may need
+  # free_gpu_memory_fraction low to have enough available memory.
+  # free_gpu_memory_fraction: 0.30
+
+# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
+# NOTE: overlap_scheduler enabled by default since this commit and changed
+# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
+# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+# NOTE: For larger max batch size, you may want to add larger cuda graph
+# batch sizes below to match.
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+print_iter_log: true
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml
new file mode 100644
index 0000000000..772b94b283
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml
@@ -0,0 +1,55 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+backend: pytorch
+
+# TP/EP/PP/DP
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+pipeline_parallel_size: 1
+enable_attention_dp: false
+
+max_batch_size: 256
+max_num_tokens: 256
+# 8448 = 8192 ISL + 256 OSL
+max_seq_len: 8448
+
+kv_cache_config:
+  # With dp attention disabled: high free_gpu_memory_fraction is fine.
+  free_gpu_memory_fraction: 0.85
+  # With dp attention enabled: large ISL at high concurrency may need
+  # free_gpu_memory_fraction low to have enough available memory.
+  # free_gpu_memory_fraction: 0.30
+
+# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
+# NOTE: overlap_scheduler enabled by default since this commit and changed
+# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
+# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
+disable_overlap_scheduler: false
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+# NOTE: For larger max batch size, you may want to add larger cuda graph
+# batch sizes below to match.
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+print_iter_log: true
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml
new file mode 100644
index 0000000000..6ae899a68a
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml
@@ -0,0 +1,37 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+backend: pytorch
+
+# TP/EP/PP/DP
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+pipeline_parallel_size: 1
+enable_attention_dp: true
+
+max_batch_size: 1
+max_num_tokens: 8192
+max_seq_len: 8192
+
+kv_cache_config:
+  free_gpu_memory_fraction: 0.75
+
+# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
+# NOTE: overlap_scheduler enabled by default since this commit and changed
+# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
+# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
+disable_overlap_scheduler: true
+print_iter_log: true
+# NOTE: This dtype must match in both prefill/decode configs
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml
new file mode 100644
index 0000000000..f0b5411221
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml
@@ -0,0 +1,50 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NOTE: FP4 only supported starting with Blackwell GPUs.
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+# You can also specify the full path to locally downloaded weights
+# instead of a HuggingFace ID here.
+
+backend: pytorch
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+enable_attention_dp: true
+max_batch_size: 256
+# 8448 = 8192 ISL + 256 OSL
+max_num_tokens: 8448
+max_seq_len: 8448
+kv_cache_config:
+  free_gpu_memory_fraction: 0.30
+
+# Enable the MTP(Multi-Token Prediction) in the model engine
+speculative_config:
+  decoding_type: MTP
+  num_nextn_predict_layers: 1
+
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+print_iter_log: true
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml
new file mode 100644
index 0000000000..ab48b2e78b
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml
@@ -0,0 +1,53 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NOTE: FP4 only supported starting with Blackwell GPUs.
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+# You can also specify the full path to locally downloaded weights
+# instead of a HuggingFace ID here.
+
+backend: pytorch
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+enable_attention_dp: false
+max_batch_size: 256
+# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
+# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
+# This is a known issue in TensorRT-LLM and will be resolved in the next release.
+max_num_tokens: 512
+# 8704 = 8192 ISL + 512 OSL
+max_seq_len: 8704
+kv_cache_config:
+  free_gpu_memory_fraction: 0.85
+
+# Enable the MTP(Multi-Token Prediction) in decode model engine
+speculative_config:
+  decoding_type: MTP
+  num_nextn_predict_layers: 1
+
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+print_iter_log: true
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml
new file mode 100644
index 0000000000..ee6ee26a94
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml
@@ -0,0 +1,37 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NOTE: FP4 only supported starting with Blackwell GPUs.
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+# You can also specify the full path to locally downloaded weights
+# instead of a HuggingFace ID here.
+
+backend: pytorch
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+enable_attention_dp: true
+max_batch_size: 1
+max_num_tokens: 8192
+max_seq_len: 8192
+kv_cache_config:
+  free_gpu_memory_fraction: 0.75
+print_iter_log: true
+kv_cache_dtype: fp8
+disable_overlap_scheduler: true
+
+# Enable the MTP(Multi-Token Prediction) in the prefill model engine
+speculative_config:
+  decoding_type: MTP
+  num_nextn_predict_layers: 1
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml
new file mode 100644
index 0000000000..c51abf9d95
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml
@@ -0,0 +1,36 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: round-robin
+
+TensorRTLLMWorker:
+  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  # NOTE: FP4 only supported starting with Blackwell GPUs.
+  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
+  # You can also specify the full path to locally downloaded weights
+  # instead of a HuggingFace ID here.
+  model-path: "nvidia/DeepSeek-R1-FP4"
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/agg_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml
new file mode 100644
index 0000000000..5fe2679809
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml
@@ -0,0 +1,52 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: round-robin
+
+TensorRTLLMWorker:
+  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  # NOTE: FP4 only supported starting with Blackwell GPUs.
+  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
+  # You can also specify the full path to locally downloaded weights
+  # instead of a HuggingFace ID here.
+  model-path: "nvidia/DeepSeek-R1-FP4"
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/decode_config.yaml"
+  router: round-robin
+  enable-disagg: true
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4
+
+TensorRTLLMPrefillWorker:
+  # NOTE: FP4 only supported starting with Blackwell GPUs.
+  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
+  # You can also specify the full path to locally downloaded weights
+  # instead of a HuggingFace ID here.
+  model-path: "nvidia/DeepSeek-R1-FP4"
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md
new file mode 100644
index 0000000000..342cd45129
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md
@@ -0,0 +1,275 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
+
+To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
+the set of nodes need to be launched together in the same MPI world, such as
+via `mpirun` or `srun`. This is true regardless of whether the worker is
+aggregated, prefill-only, or decode-only.
+
+In this document we will demonstrate two examples launching multinode workers
+on a slurm cluster with `srun`:
+1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
+   worker across 4 GB200 nodes
+2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
+   TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
+   worker (4 nodes) across a total of 8 GB200 nodes.
+
+NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
+`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
+using `mpirun` directly, with relative ease.
+
+## Setup
+
+For simplicity of the example, we will make some assumptions about your slurm cluster:
+1. First, we assume you have access to a slurm cluster with multiple GPU nodes
+   available. For functional testing, most setups should be fine. For performance
+   testing, you should aim to allocate groups of nodes that are performantly
+   inter-connected, such as those in an NVL72 setup.
+2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
+   SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
+   example will use `srun` arguments like `--container-image`,
+   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
+   If your cluster supports similar container based plugins, you may be able to
+   modify the script to use that instead.
+3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
+   described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker).
+   This is the image that can be set to the `IMAGE` environment variable in later steps.
+4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
+   will allocate 8 nodes below as a reference command to have enough capacity
+   to run both examples. If you plan to only run the aggregated example, you
+   will only need 4 nodes. If you customize the configurations to require a
+   different number of nodes, you can adjust the number of allocated nodes
+   accordingly. Pre-allocating nodes is technically not a requirement,
+   but it makes iterations of testing/experimenting easier.
+
+   Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
+    ```bash
+    # Set partition manually based on your slurm cluster's partition names
+    PARTITION=""
+    # Set account manually if this command doesn't work on your cluster
+    ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+    salloc \
+      --partition="${PARTITION}" \
+      --account="${ACCOUNT}" \
+      --job-name="${ACCOUNT}-dynamo.trtllm" \
+      -t 05:00:00 \
+      --nodes 8
+    ```
+5. Lastly, we will assume you are inside an interactive shell on one of your allocated
+   nodes, which may be the default behavior after executing the `salloc` command above
+   depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
+
+### Environment Variable Setup
+
+This example aims to automate as much of the environment setup as possible,
+but all slurm clusters and environments are different, and you may need to
+dive into the scripts to make modifications based on your specific environment.
+
+Assuming you have already allocated your nodes via `salloc`, and are
+inside an interactive shell on one of the allocated nodes, set the
+following environment variables based:
+```bash
+# NOTE: IMAGE must be set manually for now
+# To build an iamge, see the steps here:
+# https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker
+export IMAGE="<dynamo_trtllm_image>"
+
+# MOUNTS are the host:container path pairs that are mounted into the containers
+# launched by each `srun` command.
+#
+# If you want to reference files, such as $MODEL_PATH below, in a
+# different location, you can customize MOUNTS or specify additional
+# comma-separated mount pairs here.
+#
+# NOTE: Currently, this example assumes that the local bash scripts and configs
+# referenced are mounted into into /mnt inside the container. If you want to
+# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
+# accordingly for the new locations of `start_frontend_services.sh` and
+# `start_trtllm_worker.sh`.
+#
+# For example, assuming your cluster had a `/lustre` directory on the host, you
+# could add that as a mount like so:
+#
+# export MOUNTS="${PWD}:/mnt,/lustre:/lustre"
+export MOUNTS="${PWD}:/mnt"
+
+# NOTE: In general, Deepseek R1 is very large, so it is recommended to
+# pre-download the model weights and save them in some shared location,
+# NFS storage, HF_CACHE, etc. and modify the `--model-path` below
+# to reuse the pre-downloaded weights instead.
+#
+# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+#
+# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
+# https://huggingface.co/deepseek-ai/DeepSeek-R1
+export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
+
+# The name the model will be served/queried under, matching what's
+# returned by the /v1/models endpoint.
+#
+# By default this is inferred from MODEL_PATH, but when using locally downloaded
+# model weights, it can be nice to have explicit control over the name.
+export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+```
+
+## Aggregated WideEP
+
+Assuming you have at least 4 nodes allocated following the setup steps above,
+follow these steps below to launch an **aggregated** deployment across 4 nodes:
+
+```bash
+# Default set in srun_aggregated.sh, but can customize here.
+# export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg.yaml"
+
+# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
+# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
+# total GPUs necessary to satisfy the requested parallelism. For example,
+# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
+# export NUM_NODES=4
+
+# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
+# export NUM_GPUS_PER_NODE=4
+
+# Launches:
+# - frontend + etcd/nats on current (head) node
+# - one large aggregated trtllm worker across multiple nodes via MPI tasks
+./srun_aggregated.sh
+```
+
+## Disaggregated WideEP
+
+Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
+following the setup above, follow these steps below to launch a **disaggregated**
+deployment across 8 nodes:
+
+> [!Tip]
+> Make sure you have a fresh environment and don't still have the aggregated
+> example above still deployed on the same set of nodes.
+
+```bash
+# Defaults set in srun_disaggregated.sh, but can customize here.
+# export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml"
+# export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml"
+
+# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
+# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
+# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
+# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
+# GPUs necessary to satisfy the requested parallelism in each config.
+# export NUM_PREFILL_NODES=4
+# export NUM_DECODE_NODES=4
+
+# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
+# export NUM_GPUS_PER_NODE=4
+
+# Launches:
+# - frontend + etcd/nats on current (head) node.
+# - one large prefill trtllm worker across multiple nodes via MPI tasks
+# - one large decode trtllm worker across multiple nodes via MPI tasks
+./srun_disaggregated.sh
+```
+
+## Understanding the Output
+
+1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
+   etcd, NATS, and the OpenAI frontend on the head node only
+   called "node1" in the example output below. The second launches
+   a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
+   using 4 GPUs each.
+    ```
+    # Frontend/etcd/nats services
+    srun: launching StepId=453374.17 on host node1, 1 tasks: 0
+    ...
+    # TP16 TRTLLM worker split across 4 nodes with 4 gpus each
+    srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3]
+    srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7]
+    srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11]
+    srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15]
+   ```
+2. The OpenAI frontend will listen for and dynamically discover workers as
+   they register themselves with Dynamo's distributed runtime:
+   ```
+   0: 2025-06-13T02:36:48.160Z  INFO dynamo_run::input::http: Watching for remote model at models
+   0: 2025-06-13T02:36:48.161Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
+   ```
+3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
+   GPU on each node, which will each output their progress while loading the model.
+   You can see each rank's output prefixed with the rank at the start of each log line
+   until the model succesfully finishes loading:
+    ```
+     8: rank8 run mgmn worker node with mpi_world_size: 16 ...
+    10: rank10 run mgmn worker node with mpi_world_size: 16 ...
+     9: rank9 run mgmn worker node with mpi_world_size: 16 ...
+    11: rank11 run mgmn worker node with mpi_world_size: 16 ...
+    ...
+    15: Model init total -- 55.42s
+    11: Model init total -- 55.91s
+    12: Model init total -- 55.24s
+    ```
+4. After the model fully finishes loading on all ranks, the worker will register itself,
+   and the OpenAI frontend will detect it, signaled by this output:
+    ```
+    0: 2025-06-13T02:46:35.040Z  INFO dynamo_llm::discovery::watcher: added model model_name="nvidia/DeepSeek-R1-FP4"
+    ```
+5. At this point, with the worker fully initialized and detected by the frontend,
+   it is now ready for inference.
+6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
+   three srun jobs instead of two. One for frontend, one for prefill worker,
+   and one for decode worker.
+
+## Example Request
+
+To verify the deployed model is working, send a `curl` request:
+```bash
+# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead.
+HOST=localhost
+PORT=8000
+# "model" here should match the model name returned by the /v1/models endpoint
+curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": "'${SERVED_MODEL_NAME}'",
+  "messages": [
+  {
+    "role": "user",
+    "content": "Tell me a story as if we were playing dungeons and dragons."
+  }
+  ],
+  "stream": true,
+  "max_tokens": 30
+}'
+```
+
+## Cleanup
+
+To cleanup background `srun` processes launched by `srun_aggregated.sh` or
+`srun_disaggregated.sh`, you can run:
+```bash
+pkill srun
+```
+
+## Known Issues
+
+- This example has only been tested on a 4xGB200 node setup with 16 GPUs using
+  FP4 weights. In theory, the example should work on alternative setups such as
+  H100 nodes with FP8 weights, but this hasn't been tested yet.
+- This example only tests an aggregated model setup for now. A disaggregated
+  serving example will be added in the near future.
+- WideEP configs in this directory are still being tested. A WideEP specific
+  example with documentation will be added once ready.
+- There are known issues where WideEP workers may not cleanly shut down:
+    - This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
+      now, you must manually clean these up before deploying again on the
+      same set of nodes.
+    - Similarly, there may be GPU memory left in-use after killing the `srun`
+      jobs. After cleaning up any leftover shared memory files as described
+      above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
+      to check on this behavior. If you don't free the GPU memory before the
+      next deployment, you may get a CUDA OOM error while loading the model.
+    - There is mention of this issue in the relevant TRT-LLM blog
+      [here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml
new file mode 100644
index 0000000000..d697caacfa
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml
@@ -0,0 +1,27 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Example of a Multi-node worker, but no WideEP or EPLB.
+# See wide_ep*.yaml for WideEP example configs.
+backend: pytorch
+tensor_parallel_size: 16
+moe_expert_parallel_size: 16
+enable_attention_dp: true
+max_batch_size: 256
+max_num_tokens: 256
+max_seq_len: 8448
+kv_cache_config:
+  free_gpu_memory_fraction: 0.7
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml
new file mode 100644
index 0000000000..f2fe0a13c6
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml
@@ -0,0 +1,7 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# moe_load_balancer settings for TRTLLM based on:
+# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer
+num_slots: 288
+layer_updates_per_iter: 2
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml
new file mode 100644
index 0000000000..5bbc66bd69
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml
@@ -0,0 +1,35 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+backend: pytorch
+
+# WideEP related settings
+moe_backend: WideEP
+# moe_max_num_tokens will default to max_num_tokens if left unspecified.
+#
+# If you want to set this value explicitly, one recommendation is below:
+#   moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
+#   4096 = 256 * 16
+# moe_max_num_tokens: 4096
+moe_load_balancer: /mnt/engine_configs/eplb.yaml
+tensor_parallel_size: 16
+moe_expert_parallel_size: 16
+
+enable_attention_dp: true
+max_batch_size: 256
+max_num_tokens: 256
+max_seq_len: 8448
+kv_cache_config:
+  free_gpu_memory_fraction: 0.7
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml
new file mode 100644
index 0000000000..ac7fc7e8f6
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml
@@ -0,0 +1,59 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+backend: pytorch
+
+# WideEP related settings
+moe_backend: WideEP
+moe_load_balancer: /mnt/engine_configs/eplb.yaml
+
+# TP/EP/PP/DP
+tensor_parallel_size: 16
+moe_expert_parallel_size: 16
+pipeline_parallel_size: 1
+enable_attention_dp: true
+
+max_batch_size: 256
+max_num_tokens: 256
+# 8448 = 8192 ISL + 256 OSL
+max_seq_len: 8448
+
+kv_cache_config:
+  # With dp attention disabled: high free_gpu_memory_fraction is fine.
+  # free_gpu_memory_fraction: 0.85
+  # With dp attention enabled: large ISL at high concurrency may need
+  # free_gpu_memory_fraction low to have enough available memory.
+  free_gpu_memory_fraction: 0.30
+
+# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
+# NOTE: overlap_scheduler enabled by default since this commit and changed
+# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
+# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
+disable_overlap_scheduler: false
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+# NOTE: For larger max batch size, you may want to add larger cuda graph
+# batch sizes below to match.
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+print_iter_log: true
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml
new file mode 100644
index 0000000000..06968a3a76
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml
@@ -0,0 +1,41 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+backend: pytorch
+
+# WideEP related settings
+moe_backend: WideEP
+moe_load_balancer: /mnt/engine_configs/eplb.yaml
+
+# TP/EP/PP/DP
+tensor_parallel_size: 16
+moe_expert_parallel_size: 16
+pipeline_parallel_size: 1
+enable_attention_dp: true
+
+max_batch_size: 1
+max_num_tokens: 8192
+max_seq_len: 8192
+
+kv_cache_config:
+  free_gpu_memory_fraction: 0.75
+
+# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
+# NOTE: overlap_scheduler enabled by default since this commit and changed
+# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
+# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
+disable_overlap_scheduler: true
+print_iter_log: true
+# NOTE: This dtype must match in both prefill/decode configs
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh
new file mode 100755
index 0000000000..5a632551b9
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh
@@ -0,0 +1,75 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# This is one of the only variables that must be set currently, most of the rest may
+# just work out of the box if following the steps in the README.
+IMAGE="${IMAGE:-""}"
+
+# Set to mount current host directory to /mnt inside the container as an example,
+# but you may freely customize the mounts based on your cluster. A common practice
+# is to mount paths to NFS storage for common scripts, model weights, etc.
+# NOTE: This can be a comma separated list of multiple mounts as well.
+DEFAULT_MOUNT="${PWD}:/mnt"
+MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
+
+# Example values, assuming 4 nodes with 4 GPUs on each node, such as 4xGB200 nodes.
+# For 8xH100 nodes as an example, you may set this to 2 nodes x 8 gpus/node instead.
+NUM_NODES=${NUM_NODES:-4}
+NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
+
+export ENGINE_CONFIG="${ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_agg.yaml}"
+
+# Automate settings of certain variables for convenience, but you are free
+# to manually set these for more control as well.
+ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+export HEAD_NODE="${SLURMD_NODENAME}"
+export HEAD_NODE_IP="$(hostname -i)"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+
+if [[ -z ${IMAGE} ]]; then
+  echo "ERROR: You need to set the IMAGE environment variable to the " \
+       "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
+       "See how to build one from source here: " \
+       "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker"
+  exit 1
+fi
+
+# NOTE: Output streamed to stdout for ease of understanding the example, but
+# in practice you would probably set `srun --output ... --error ...` to pipe
+# the stdout/stderr to files.
+echo "Launching frontend services in background."
+srun \
+  --overlap \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodelist "${HEAD_NODE}" \
+  --nodes 1 \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_frontend_services.sh &
+
+# NOTE: Output streamed to stdout for ease of understanding the example, but
+# in practice you would probably set `srun --output ... --error ...` to pipe
+# the stdout/stderr to files.
+echo "Launching multi-node worker in background."
+# No --task for the worker defaults to aggregated mode
+TASK="" \
+srun \
+  --mpi pmix \
+  --oversubscribe \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodes "${NUM_NODES}" \
+  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_trtllm_worker.sh &
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh
new file mode 100755
index 0000000000..32cb4993a9
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh
@@ -0,0 +1,94 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# This is one of the only variables that must be set currently, most of the rest may
+# just work out of the box if following the steps in the README.
+IMAGE="${IMAGE:-""}"
+
+# Set to mount current host directory to /mnt inside the container as an example,
+# but you may freely customize the mounts based on your cluster. A common practice
+# is to mount paths to NFS storage for common scripts, model weights, etc.
+# NOTE: This can be a comma separated list of multiple mounts as well.
+DEFAULT_MOUNT="${PWD}:/mnt"
+MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
+
+NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
+
+NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4}
+PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_prefill.yaml}"
+
+NUM_DECODE_NODES=${NUM_DECODE_NODES:-4}
+DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_decode.yaml}"
+
+# Automate settings of certain variables for convenience, but you are free
+# to manually set these for more control as well.
+ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+export HEAD_NODE="${SLURMD_NODENAME}"
+export HEAD_NODE_IP="$(hostname -i)"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+
+if [[ -z ${IMAGE} ]]; then
+  echo "ERROR: You need to set the IMAGE environment variable to the " \
+       "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
+       "See how to build one from source here: " \
+       "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker"
+  exit 1
+fi
+
+# NOTE: Output streamed to stdout for ease of understanding the example, but
+# in practice you would probably set `srun --output ... --error ...` to pipe
+# the stdout/stderr to files.
+echo "Launching frontend services in background."
+srun \
+  --overlap \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodelist "${HEAD_NODE}" \
+  --nodes 1 \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_frontend_services.sh &
+
+# NOTE: Output streamed to stdout for ease of understanding the example, but
+# in practice you would probably set `srun --output ... --error ...` to pipe
+# the stdout/stderr to files.
+echo "Launching multi-node prefill worker in background."
+TASK=prefill \
+ENGINE_CONFIG=${PREFILL_ENGINE_CONFIG} \
+srun \
+  --mpi pmix \
+  --oversubscribe \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodes "${NUM_PREFILL_NODES}" \
+  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_trtllm_worker.sh &
+
+echo "Launching multi-node decode worker in background."
+TASK=decode \
+ENGINE_CONFIG=${DECODE_ENGINE_CONFIG} \
+srun \
+  --mpi pmix \
+  --oversubscribe \
+  --container-image "${IMAGE}" \
+  --container-mounts "${MOUNTS}" \
+  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
+  --verbose \
+  --label \
+  -A "${ACCOUNT}" \
+  -J "${ACCOUNT}-dynamo.trtllm" \
+  --nodes "${NUM_DECODE_NODES}" \
+  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
+  --jobid "${SLURM_JOB_ID}" \
+  /mnt/start_trtllm_worker.sh &
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh
new file mode 100755
index 0000000000..0d1b588904
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Start NATS
+nats-server -js &
+
+# Start etcd
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
+
+# Wait for NATS/etcd to startup
+sleep 3
+
+# Start OpenAI Frontend which will dynamically discover workers when they startup
+# NOTE: This is a blocking call.
+dynamo-run in=http out=dyn --http-port 8000
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh
new file mode 100755
index 0000000000..257b3b1127
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh
@@ -0,0 +1,46 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+if [[ -z ${MODEL_PATH} ]]; then
+    echo "ERROR: MODEL_PATH was not set."
+    echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \
+         "downloaded path to the model weights. Since Deepseek R1 is large, it is " \
+         "recommended to pre-download them to a shared location and provide the path."
+    exit 1
+fi
+
+if [[ -z ${SERVED_MODEL_NAME} ]]; then
+    echo "WARNING: SERVED_MODEL_NAME was not set. It will be derived from MODEL_PATH."
+fi
+
+
+
+if [[ -z ${ENGINE_CONFIG} ]]; then
+    echo "ERROR: ENGINE_CONFIG was not set."
+    echo "ERROR: ENGINE_CONFIG must be set to a valid Dynamo+TRTLLM engine config file."
+    exit 1
+fi
+
+EXTRA_ARGS=""
+if [[ -n ${TASK} ]]; then
+  EXTRA_ARGS+="--task ${TASK}"
+fi
+
+# NOTE: When this script is run directly from srun, the environment variables
+# for TRTLLM KV cache are not set. So we need to set them here.
+# Related issue: https://github.com/ai-dynamo/dynamo/issues/1743
+if [[ -z ${TRTLLM_USE_UCX_KVCACHE} ]] && [[ -z ${TRTLLM_USE_NIXL_KVCACHE} ]]; then
+    export TRTLLM_USE_UCX_KVCACHE=1
+fi
+
+# NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM
+# worker and registers itself with the runtime. It is currently easier to wrap
+# this standalone script with `trtllm-llmapi-launch` for MPI handling purposes,
+# but this may be refactored into 'dynamo serve' in the future.
+trtllm-llmapi-launch \
+  python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py \
+    --model-path "${MODEL_PATH}" \
+    --model-name "${SERVED_MODEL_NAME}" \
+    --extra-engine-args "${ENGINE_CONFIG}" \
+    ${EXTRA_ARGS}
diff --git a/examples/tensorrt_llm_sd/configs/disagg.yaml b/examples/tensorrt_llm_sd/configs/disagg.yaml
new file mode 100644
index 0000000000..454e1640e6
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/disagg.yaml
@@ -0,0 +1,48 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: round-robin
+
+TensorRTLLMWorker:
+  # Path to disk model or HuggingFace model identifier to load
+  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Name to serve the model under
+  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/engine_configs/decode_config.yaml"
+  enable-disagg: true
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 1
+
+TensorRTLLMPrefillWorker:
+  # Path to disk model or HuggingFace model identifier to load
+  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/engine_configs/prefill_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 1
+
diff --git a/examples/tensorrt_llm_sd/configs/disagg_router.yaml b/examples/tensorrt_llm_sd/configs/disagg_router.yaml
new file mode 100644
index 0000000000..faae7f65a3
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/disagg_router.yaml
@@ -0,0 +1,47 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: kv
+
+TensorRTLLMWorker:
+  # Path to disk model or HuggingFace model identifier to load
+  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Name to serve the model under
+  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/engine_configs/decode_config.yaml"
+  enable-disagg: true
+  router: kv
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 1
+
+TensorRTLLMPrefillWorker:
+  # Path to disk model or HuggingFace model identifier to load
+  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/engine_configs/prefill_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 1
\ No newline at end of file
diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml
new file mode 100644
index 0000000000..02b5cd8463
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml
@@ -0,0 +1,31 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+enable_attention_dp: false
+max_num_tokens: 8192
+max_batch_size: 16
+trust_remote_code: true
+backend: pytorch
+enable_chunked_prefill: true
+
+kv_cache_config:
+  free_gpu_memory_fraction: 0.95
+
+# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
+# NOTE: overlap_scheduler enabled by default since this commit and changed
+# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
+# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
+use_cuda_graph: true
diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml
new file mode 100644
index 0000000000..eb943fd6e7
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml
@@ -0,0 +1,27 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+enable_attention_dp: false
+max_num_tokens: 8192
+max_batch_size: 16
+trust_remote_code: true
+backend: pytorch
+enable_chunked_prefill: true
+disable_overlap_scheduler: false
+use_cuda_graph: true
+kv_cache_config:
+  free_gpu_memory_fraction: 0.95
+
diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml
new file mode 100644
index 0000000000..5dee9e653d
--- /dev/null
+++ b/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml
@@ -0,0 +1,28 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+tensor_parallel_size: 1
+moe_expert_parallel_size: 1
+enable_attention_dp: false
+max_num_tokens: 8192
+max_batch_size: 16
+trust_remote_code: true
+backend: pytorch
+enable_chunked_prefill: true
+# Overlap scheduler not currently supported in prefill only workers.
+disable_overlap_scheduler: true
+use_cuda_graph: false
+
+kv_cache_config:
+  free_gpu_memory_fraction: 0.95
diff --git a/examples/tensorrt_llm_sd/graphs/agg.py b/examples/tensorrt_llm_sd/graphs/agg.py
new file mode 100644
index 0000000000..e79f5f315c
--- /dev/null
+++ b/examples/tensorrt_llm_sd/graphs/agg.py
@@ -0,0 +1,19 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from components.frontend import Frontend
+from components.worker import TensorRTLLMWorker
+
+Frontend.link(TensorRTLLMWorker)
diff --git a/examples/tensorrt_llm_sd/graphs/disagg.py b/examples/tensorrt_llm_sd/graphs/disagg.py
new file mode 100644
index 0000000000..58bde05d9a
--- /dev/null
+++ b/examples/tensorrt_llm_sd/graphs/disagg.py
@@ -0,0 +1,20 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from components.frontend import Frontend
+from components.prefill_worker import TensorRTLLMPrefillWorker
+from components.worker import TensorRTLLMWorker
+
+Frontend.link(TensorRTLLMWorker).link(TensorRTLLMPrefillWorker)

From bc39d0e6d7bb06070a14243cac8dc3a46e547e31 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Tue, 8 Jul 2025 10:12:35 -0700
Subject: [PATCH 03/20] llama4+eagle3 configuration

---
 .../configs/{deepseek_r1 => llama4_eagle3}/agg.yaml  | 12 ++++--------
 .../{deepseek_r1 => llama4_eagle3}/disagg.yaml       |  0
 .../engine_configs/agg_config.yaml                   |  0
 .../engine_configs/decode_config.yaml                |  0
 .../engine_configs/prefill_config.yaml               |  0
 .../mtp/engine_configs/agg_config.yaml               |  0
 .../mtp/engine_configs/decode_config.yaml            |  0
 .../mtp/engine_configs/prefill_config.yaml           |  0
 .../{deepseek_r1 => llama4_eagle3}/mtp/mtp_agg.yaml  | 12 ++++--------
 .../mtp/mtp_disagg.yaml                              |  0
 .../multinode/README.md                              |  0
 .../multinode/engine_configs/dep16_agg.yaml          |  0
 .../multinode/engine_configs/eplb.yaml               |  0
 .../multinode/engine_configs/wide_ep_agg.yaml        |  0
 .../multinode/engine_configs/wide_ep_decode.yaml     |  0
 .../multinode/engine_configs/wide_ep_prefill.yaml    |  0
 .../multinode/srun_aggregated.sh                     |  0
 .../multinode/srun_disaggregated.sh                  |  0
 .../multinode/start_frontend_services.sh             |  0
 .../multinode/start_trtllm_worker.sh                 |  0
 20 files changed, 8 insertions(+), 16 deletions(-)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/agg.yaml (69%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/disagg.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/engine_configs/agg_config.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/engine_configs/decode_config.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/engine_configs/prefill_config.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/engine_configs/agg_config.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/engine_configs/decode_config.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/engine_configs/prefill_config.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/mtp_agg.yaml (71%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/mtp_disagg.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/README.md (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/dep16_agg.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/eplb.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/wide_ep_agg.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/wide_ep_decode.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/wide_ep_prefill.yaml (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/srun_aggregated.sh (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/srun_disaggregated.sh (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/start_frontend_services.sh (100%)
 rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/start_trtllm_worker.sh (100%)

diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml
similarity index 69%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml
index f7cec35e7d..3ac3facedd 100644
--- a/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml
+++ b/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml
@@ -15,19 +15,15 @@
 
 Frontend:
   # This is the client-facing model name, you can set this to anything you'd like.
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
   endpoint: dynamo.TensorRTLLMWorker.generate
   port: 8000
   router: round-robin
 
 TensorRTLLMWorker:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  extra-engine-args: "configs/deepseek_r1/engine_configs/agg_config.yaml"
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  extra-engine-args: "configs/llama4_eagle3/engine_configs/agg_config.yaml"
   router: round-robin
   ServiceArgs:
     workers: 1
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml
similarity index 71%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml
index c51abf9d95..626ca27953 100644
--- a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml
+++ b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml
@@ -14,21 +14,17 @@
 # limitations under the License.
 
 Frontend:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
   endpoint: dynamo.TensorRTLLMWorker.generate
   port: 8000
   router: round-robin
 
 TensorRTLLMWorker:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
   # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
   # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/agg_config.yaml"
+  extra-engine-args: "configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml"
   router: round-robin
   ServiceArgs:
     workers: 1
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh
diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh
similarity index 100%
rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh
rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh

From 4a0b8c64b00a6d5882324bdbeda3cfc52fc4ab14 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Tue, 8 Jul 2025 11:59:20 -0700
Subject: [PATCH 04/20] Test

---
 examples/tensorrt_llm_sd/configs/disagg.yaml | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/examples/tensorrt_llm_sd/configs/disagg.yaml b/examples/tensorrt_llm_sd/configs/disagg.yaml
index 454e1640e6..73c202f4be 100644
--- a/examples/tensorrt_llm_sd/configs/disagg.yaml
+++ b/examples/tensorrt_llm_sd/configs/disagg.yaml
@@ -14,16 +14,16 @@
 # limitations under the License.
 
 Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  served_model_name: nvidia/Llama-4-Maverick-17B-128E-Eagle3
   endpoint: dynamo.TensorRTLLMWorker.generate
   port: 8000
   router: round-robin
 
 TensorRTLLMWorker:
   # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  model-path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
   # Name to serve the model under
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  served_model_name: nvidia/Llama-4-Maverick-17B-128E-Eagle3
   # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
   # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
   extra-engine-args: "configs/engine_configs/decode_config.yaml"
@@ -36,7 +36,7 @@ TensorRTLLMWorker:
 
 TensorRTLLMPrefillWorker:
   # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+  model-path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
   # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
   # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
   extra-engine-args: "configs/engine_configs/prefill_config.yaml"

From e889649d14cfe0fd447fcc3e9ab13a24d2286461 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Tue, 8 Jul 2025 19:18:38 -0700
Subject: [PATCH 05/20] Modified Workflow

---
 .../eagle/engine_configs/agg_config.yaml      | 50 +++++++++++++++++
 .../eagle/engine_configs/decode_config.yaml   | 53 +++++++++++++++++++
 .../eagle/engine_configs/prefill_config.yaml  | 37 +++++++++++++
 .../configs/llama4/eagle/mtp_agg.yaml         | 31 +++++++++++
 .../configs/llama4/eagle/mtp_disagg.yaml      | 52 ++++++++++++++++++
 5 files changed, 223 insertions(+)
 create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
 create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
 create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
 create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml
 create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
new file mode 100644
index 0000000000..633d630633
--- /dev/null
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -0,0 +1,50 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NOTE: FP4 only supported starting with Blackwell GPUs.
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+# You can also specify the full path to locally downloaded weights
+# instead of a HuggingFace ID here.
+
+backend: pytorch
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+# enable_attention_dp: true
+max_batch_size: 256
+# 8448 = 8192 ISL + 256 OSL
+max_num_tokens: 8448
+max_seq_len: 8448
+kv_cache_config:
+  free_gpu_memory_fraction: 0.30
+
+# Enable the MTP(Multi-Token Prediction) in the model engine
+speculative_config:
+  decoding_type: MTP
+  num_nextn_predict_layers: 1
+
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+print_iter_log: true
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
new file mode 100644
index 0000000000..fed64bcb22
--- /dev/null
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
@@ -0,0 +1,53 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NOTE: FP4 only supported starting with Blackwell GPUs.
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+# You can also specify the full path to locally downloaded weights
+# instead of a HuggingFace ID here.
+
+backend: pytorch
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+# enable_attention_dp: false
+max_batch_size: 256
+# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
+# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
+# This is a known issue in TensorRT-LLM and will be resolved in the next release.
+max_num_tokens: 512
+# 8704 = 8192 ISL + 512 OSL
+max_seq_len: 8704
+kv_cache_config:
+  free_gpu_memory_fraction: 0.85
+
+# Enable the MTP(Multi-Token Prediction) in decode model engine
+speculative_config:
+  decoding_type: MTP
+  num_nextn_predict_layers: 1
+
+use_cuda_graph: true
+cuda_graph_padding_enabled: true
+cuda_graph_batch_sizes:
+- 1
+- 2
+- 4
+- 8
+- 16
+- 32
+- 64
+- 128
+- 256
+print_iter_log: true
+kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
new file mode 100644
index 0000000000..6dd4bca5ed
--- /dev/null
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
@@ -0,0 +1,37 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NOTE: FP4 only supported starting with Blackwell GPUs.
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+# You can also specify the full path to locally downloaded weights
+# instead of a HuggingFace ID here.
+
+backend: pytorch
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+# enable_attention_dp: true
+max_batch_size: 1
+max_num_tokens: 8192
+max_seq_len: 8192
+kv_cache_config:
+  free_gpu_memory_fraction: 0.75
+print_iter_log: true
+kv_cache_dtype: fp8
+disable_overlap_scheduler: true
+
+# Enable the MTP(Multi-Token Prediction) in the prefill model engine
+speculative_config:
+  decoding_type: MTP
+  num_nextn_predict_layers: 1
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml
new file mode 100644
index 0000000000..6a64336101
--- /dev/null
+++ b/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml
@@ -0,0 +1,31 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  # This is the client-facing model name, you can set this to anything you'd like.
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: round-robin
+
+TensorRTLLMWorker:
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  extra-engine-args: "configs/llama4/engine_configs/agg_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 8
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml
new file mode 100644
index 0000000000..72d3ce6f29
--- /dev/null
+++ b/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml
@@ -0,0 +1,52 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  endpoint: dynamo.TensorRTLLMWorker.generate
+  port: 8000
+  router: round-robin
+
+TensorRTLLMWorker:
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  # NOTE: FP4 only supported starting with Blackwell GPUs.
+  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
+  # You can also specify the full path to locally downloaded weights
+  # instead of a HuggingFace ID here.
+  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/llama4/eagle/engine_configs/decode_config.yaml"
+  router: round-robin
+  enable-disagg: true
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4
+
+TensorRTLLMPrefillWorker:
+  # NOTE: FP4 only supported starting with Blackwell GPUs.
+  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
+  # You can also specify the full path to locally downloaded weights
+  # instead of a HuggingFace ID here.
+  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
+  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
+  extra-engine-args: "configs/llama4/eagle/engine_configs/prefill_config.yaml"
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4

From 7b4c32a00fe94b4b8637fa916d12f9027ed97836 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Tue, 8 Jul 2025 19:26:50 -0700
Subject: [PATCH 06/20] Streamlined Example

---
 examples/tensorrt_llm_sd/README.md            | 352 ----------------
 examples/tensorrt_llm_sd/__init__.py          |  14 -
 examples/tensorrt_llm_sd/common/__init__.py   |   0
 .../tensorrt_llm_sd/common/base_engine.py     | 389 ------------------
 examples/tensorrt_llm_sd/common/parser.py     |  62 ---
 examples/tensorrt_llm_sd/common/protocol.py   | 104 -----
 .../tensorrt_llm_sd/components/frontend.py    | 119 ------
 .../components/prefill_worker.py              |  75 ----
 examples/tensorrt_llm_sd/components/worker.py | 115 ------
 examples/tensorrt_llm_sd/configs/agg.yaml     |  34 --
 .../tensorrt_llm_sd/configs/agg_router.yaml   |  34 --
 examples/tensorrt_llm_sd/configs/disagg.yaml  |  48 ---
 .../configs/disagg_router.yaml                |  47 ---
 .../configs/engine_configs/agg_config.yaml    |  31 --
 .../configs/engine_configs/decode_config.yaml |  27 --
 .../engine_configs/prefill_config.yaml        |  28 --
 .../configs/llama4_eagle3/agg.yaml            |  31 --
 .../configs/llama4_eagle3/disagg.yaml         |  49 ---
 .../engine_configs/agg_config.yaml            |  54 ---
 .../engine_configs/decode_config.yaml         |  55 ---
 .../engine_configs/prefill_config.yaml        |  37 --
 .../mtp/engine_configs/agg_config.yaml        |  50 ---
 .../mtp/engine_configs/decode_config.yaml     |  53 ---
 .../mtp/engine_configs/prefill_config.yaml    |  37 --
 .../configs/llama4_eagle3/mtp/mtp_agg.yaml    |  32 --
 .../configs/llama4_eagle3/mtp/mtp_disagg.yaml |  52 ---
 .../configs/llama4_eagle3/multinode/README.md | 275 -------------
 .../multinode/engine_configs/dep16_agg.yaml   |  27 --
 .../multinode/engine_configs/eplb.yaml        |   7 -
 .../multinode/engine_configs/wide_ep_agg.yaml |  35 --
 .../engine_configs/wide_ep_decode.yaml        |  59 ---
 .../engine_configs/wide_ep_prefill.yaml       |  41 --
 .../multinode/srun_aggregated.sh              |  75 ----
 .../multinode/srun_disaggregated.sh           |  94 -----
 .../multinode/start_frontend_services.sh      |  16 -
 .../multinode/start_trtllm_worker.sh          |  46 ---
 examples/tensorrt_llm_sd/graphs/agg.py        |  19 -
 examples/tensorrt_llm_sd/graphs/disagg.py     |  20 -
 38 files changed, 2643 deletions(-)
 delete mode 100644 examples/tensorrt_llm_sd/README.md
 delete mode 100644 examples/tensorrt_llm_sd/__init__.py
 delete mode 100644 examples/tensorrt_llm_sd/common/__init__.py
 delete mode 100644 examples/tensorrt_llm_sd/common/base_engine.py
 delete mode 100644 examples/tensorrt_llm_sd/common/parser.py
 delete mode 100644 examples/tensorrt_llm_sd/common/protocol.py
 delete mode 100644 examples/tensorrt_llm_sd/components/frontend.py
 delete mode 100644 examples/tensorrt_llm_sd/components/prefill_worker.py
 delete mode 100644 examples/tensorrt_llm_sd/components/worker.py
 delete mode 100644 examples/tensorrt_llm_sd/configs/agg.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/agg_router.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/disagg.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/disagg_router.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml
 delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml
 delete mode 100755 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh
 delete mode 100755 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh
 delete mode 100755 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh
 delete mode 100755 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh
 delete mode 100644 examples/tensorrt_llm_sd/graphs/agg.py
 delete mode 100644 examples/tensorrt_llm_sd/graphs/disagg.py

diff --git a/examples/tensorrt_llm_sd/README.md b/examples/tensorrt_llm_sd/README.md
deleted file mode 100644
index f844a56d94..0000000000
--- a/examples/tensorrt_llm_sd/README.md
+++ /dev/null
@@ -1,352 +0,0 @@
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
--->
-
-# LLM Deployment Examples using TensorRT-LLM
-
-This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
-
-## Use the Latest Release
-
-We recommend using the latest stable release of dynamo to avoid breaking changes:
-
-[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
-
-You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
-
-```bash
-git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
-```
-
-## Deployment Architectures
-
-See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
-Note that this TensorRT-LLM version does not support all the options yet.
-
-Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
-
-## Getting Started
-
-1. Choose a deployment architecture based on your requirements
-2. Configure the components as needed
-3. Deploy using the provided scripts
-
-### Prerequisites
-
-Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
-```bash
-docker compose -f deploy/metrics/docker-compose.yml up -d
-```
-
-### Build docker
-
-```bash
-# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
-apt-get update && apt-get -y install git git-lfs
-
-# On an x86 machine:
-./container/build.sh --framework tensorrtllm
-
-# On an ARM machine:
-./container/build.sh --framework tensorrtllm --platform linux/arm64
-
-# Build the container with the default experimental TensorRT-LLM commit
-# WARNING: This is for experimental feature testing only.
-# The container should not be used in a production environment.
-./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit
-```
-
-### Run container
-
-```
-./container/run.sh --framework tensorrtllm -it
-```
-## Run Deployment
-
-This figure shows an overview of the major components to deploy:
-
-
-
-```
-
-+------+      +-----------+      +------------------+             +---------------+
-| HTTP |----->| processor |----->|      Worker      |------------>|     Prefill   |
-|      |<-----|           |<-----|                  |<------------|     Worker    |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^                  |
-       query best |    | return           | publish kv events
-           worker |    | worker_id        v
-                  |    |         +------------------+
-                  |    +---------|     kv-router    |
-                  +------------->|                  |
-                                 +------------------+
-
-```
-
-Note: The above architecture illustrates all the components. The final components
-that get spawned depend upon the chosen graph.
-
-### Example architectures
-
-#### Aggregated serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
-```
-
-#### Aggregated serving with KV Routing
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f ./configs/agg_router.yaml
-```
-
-#### Disaggregated serving
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
-```
-
-#### Disaggregated serving with KV Routing
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg:Frontend -f ./configs/disagg_router.yaml
-```
-
-#### Aggregated serving with Multi-Token Prediction (MTP) and DeepSeek R1
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
-```
-
-Notes:
-- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
-
-  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
-
-- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
-- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
-
-#### Multi-Node Disaggregated Serving
-
-In the following example, we will demonstrate how to run a Disaggregated Serving
-deployment across multiple nodes. For simplicity, we will demonstrate how to
-deploy a single Decode worker on one node, and a single Prefill worker on the other node.
-However, the instance counts, TP sizes, other configs, and responsibilities of each node
-can be customized and deployed in similar ways.
-
-For example, to deploy Deepseek R1, you could replace the referenced example
-configs (`configs/agg.yaml`, `configs/disagg.yaml`) with corresponding Deepseek R1
-example configs (`configs/deepseek_r1/agg.yaml`, `configs/deepseek_r1/disagg.yaml`).
-You can find the example Deepseek R1 configs for GB200
-[here](configs/deepseek_r1), but the config settings can be customized for testing
-other hardware configurations or parallelism strategies.
-
-This "multi-node" example demonstrates how to generally connect dynamo workers from
-different nodes, but for simplicity, each worker individually fits on a single node.
-For details on how to launch a worker that spans multiple nodes due to sheer model
-size, or for features like large scale expert parallelism, see the
-[multinode worker example](configs/deepseek_r1/multinode).
-
-##### Head Node
-
-Start nats/etcd:
-```bash
-# NATS data persisted to /tmp/nats/jetstream by default
-nats-server -js &
-
-# Persist data to /tmp/etcd, otherwise defaults to ${PWD}/default.etcd if left unspecified
-etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
-
-# NOTE: Clearing out the etcd and nats jetstream data directories across runs
-#       helps to guarantee a clean and reproducible results.
-```
-
-Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
-
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f ./configs/disagg.yaml &
-```
-
-Notes:
-- The aggregated graph (`graphs.agg`) is chosen here because it also describes
-  our desired deployment settings for the head node: launching the utility components
-  (Frontend, Processor), and only the decode worker (TensorRTLLMWorker configured with
-  `remote-prefill` enabled). We plan to launch the `TensorRTLLMPrefillWorker`
-  independently on a separate node in the next step of this demonstration.
-  You are free to customize the graph and configuration of components launched on
-  each node.
-- The disaggregated config `configs/disagg.yaml` is intentionally chosen here as a
-  single source of truth to be used for deployments on all of our nodes, describing
-  the configurations for all of our components, including both decode and prefill
-  workers, but can be customized based on your deployment needs.
-
-##### Worker Node(s)
-
-Set environment variables pointing at the etcd/nats endpoints on the head node
-so the Dynamo Distributed Runtime can orchestrate communication and
-discoverability between the head node and worker nodes:
-```bash
-# if not head node
-export HEAD_NODE_IP="<head-node-ip>"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-```
-
-Deploy a Prefill worker:
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
-```
-
-Now you have a 2-node deployment with 1 Decode worker on the head node, and 1 Prefill worker on a worker node!
-
-##### Additional Notes for Multi-Node Deployments
-
-Notes:
-- To include a router in this deployment, change the graph to one that includes the router, such as `graphs.agg_router`,
-  and change the config to one that includes the router, such as `configs/disagg_router.yaml`
-- This step is assuming you're disaggregated serving and planning to launch prefill workers on separate nodes.
-  Howerver, for an aggregated deployment with additional aggregated worker replicas on other nodes, this step
-  remains mostly the same. The primary difference between aggregation and disaggregation for this step is
-  whether or not the `TensorRTLLMWorker` is configured to do `remote-prefill` or not in the config file
-  (ex: `configs/disagg.yaml` vs `configs/agg.yaml`).
-- To apply the same concept for launching additional decode workers on worker nodes, you can
-  directly start them, similar to the prefill worker step above:
-  ```bash
-  # Example: deploy decode worker only
-  cd /workspace/examples/tensorrt_llm
-  dynamo serve components.worker:TensorRTLLMWorker -f ./configs/disagg.yaml --service-name TensorRTLLMWorker &
-  ```
-- If you see an error about MPI Spawn failing during TRTLLM Worker initialziation on a Slurm-based cluster,
-  try unsetting the following environment variables before launching the TRTLLM worker. If you intend to
-  run other slurm-based commands or processes on the same node after deploying the TRTLLM worker, you may
-  want to save these values into temporary variables and then restore them afterwards.
-  ```bash
-  # Workaround for error: `mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes`
-  unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
-  ```
-
-#### Multi-Node Disaggregated Serving with Multi-Token Prediction (MTP) and DeepSeek R1
-
-Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations
-
-##### Head Node
-
-Start nats/etcd
-```bash
-nats-server -js &
-etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
-```
-
-Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
-
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_disagg.yaml  &
-```
-
-##### Worker Node(s)
-
-Set environment variables pointing at the etcd/nats endpoints on the head node.
-```bash
-export HEAD_NODE_IP="<head-node-ip>"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-```
-
-Deploy a Prefill worker:
-```bash
-cd /workspace/examples/tensorrt_llm
-dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/deepseek_r1/mtp/mtp_disagg.yaml --service-name TensorRTLLMPrefillWorker &
-```
-
-Notes:
-- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script.
-
-  Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit`
-- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
-- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
-
-
-### Client
-
-See [client](../llm/README.md#client) section to learn how to send request to the deployment.
-
-NOTE: To send a request to a multi-node deployment, target the node which deployed the `Frontend` component.
-
-### Close deployment
-
-See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
-
-### Benchmarking
-
-To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
-
-
-### KV Cache Transfer for Disaggregated Serving
-
-In disaggregated serving architectures, KV cache must be transferred between prefill and decode nodes. TensorRT-LLM supports two methods for this transfer:
-
-#### Default Method: UCX
-By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode nodes. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
-
-#### Experimental Method: NIXL
-TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
-
-**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet.
-
-#### Using NIXL for KV Cache Transfer
-
-**Note:** NIXL backend for TensorRT-LLM is currently only supported on AMD64 (x86_64) architecture. If you're running on ARM64, you'll need to use the default UCX method for KV cache transfer.
-
-To enable NIXL for KV cache transfer in disaggregated serving:
-
-1. **Build the container with NIXL support:**
-   The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support.
-
-   **Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):**
-   ```bash
-   rm -rf /tmp/trtllm_wheel
-   ```
-
-   **Build the container with NIXL support:**
-   ```bash
-   ./container/build.sh --framework tensorrtllm \
-     --use-default-experimental-tensorrtllm-commit \
-     --trtllm-use-nixl-kvcache-experimental
-   ```
-
-   **Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support.
-
-2. **Run the containerized environment:**
-   See [run container](#run-container) section to learn how to start the container image built in previous step.
-
-3. **Start the disaggregated service:**
-   See [disaggregated serving](#disaggregated-serving) to see how to start the deployment.
-
-4. **Send the request:**
-   See [client](#client) section to learn how to send the request to deployment.
-
-**Important:** Ensure that ETCD and NATS services are running before starting the service.
-
-The container will automatically configure the appropriate environment variables (`TRTLLM_USE_NIXL_KVCACHE=1`) when built with the NIXL flag. The same container image can be used to use UCX for KV cache transfer.
-```bash
-unset TRTLLM_USE_NIXL_KVCACHE
-export TRTLLM_USE_UCX_KVCACHE=1
-```
-
diff --git a/examples/tensorrt_llm_sd/__init__.py b/examples/tensorrt_llm_sd/__init__.py
deleted file mode 100644
index 3159bfe656..0000000000
--- a/examples/tensorrt_llm_sd/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/examples/tensorrt_llm_sd/common/__init__.py b/examples/tensorrt_llm_sd/common/__init__.py
deleted file mode 100644
index e69de29bb2..0000000000
diff --git a/examples/tensorrt_llm_sd/common/base_engine.py b/examples/tensorrt_llm_sd/common/base_engine.py
deleted file mode 100644
index 3df95b490c..0000000000
--- a/examples/tensorrt_llm_sd/common/base_engine.py
+++ /dev/null
@@ -1,389 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import logging
-from dataclasses import dataclass
-from typing import Any, Optional
-
-from common.protocol import DisaggregatedTypeConverter, TRTLLMWorkerRequest
-from tensorrt_llm import SamplingParams
-from tensorrt_llm.llmapi.llm_utils import update_llm_args_with_extra_options
-from tensorrt_llm.llmapi.tokenizer import tokenizer_factory
-from tensorrt_llm.serve.openai_protocol import (
-    DisaggregatedParams as OAIDisaggregatedParams,
-)
-
-from dynamo.llm import get_tensorrtllm_engine, get_tensorrtllm_publisher
-from dynamo.runtime import DistributedRuntime
-
-logger = logging.getLogger(__name__)
-
-logger.setLevel(logging.DEBUG)
-
-# Default buffer size for kv cache events.
-DEFAULT_KV_EVENT_BUFFER_MAX_SIZE = 1024
-
-
-def parse_endpoint(endpoint: str) -> tuple[str, str, str]:
-    endpoint_str = endpoint.replace("dyn://", "", 1)
-    endpoint_parts = endpoint_str.split(".")
-    if len(endpoint_parts) != 3:
-        raise ValueError(
-            f"Invalid endpoint format: '{endpoint}'. "
-            "Expected 'dyn://namespace.component.endpoint' or 'namespace.component.endpoint'."
-        )
-
-    return (endpoint_parts[0], endpoint_parts[1], endpoint_parts[2])
-
-
-@dataclass
-class BaseEngineConfig:
-    """Base engine configuration"""
-
-    namespace: str
-    component: str
-    endpoint: str
-    model_path: str
-    served_model_name: Optional[str] = None
-    kv_block_size: int = 32
-    extra_engine_args: str = ""
-    publish_events_and_metrics: bool = False
-    disaggregation_mode: str = "prefill_and_decode"
-    remote_prefill_endpoint: Optional[str] = None
-    lease_id: int = 0
-
-    def __str__(self) -> str:
-        return (
-            f"Config(namespace={self.namespace}, "
-            f"component={self.component}, "
-            f"endpoint={self.endpoint}, "
-            f"model_path={self.model_path}, "
-            f"served_model_name={self.served_model_name}, "
-            f"kv_block_size={self.kv_block_size}, "
-            f"extra_engine_args={self.extra_engine_args}, "
-            f"publish_events_and_metrics={self.publish_events_and_metrics}, "
-            f"disaggregation_mode={self.disaggregation_mode}, "
-            f"remote_prefill_endpoint={self.remote_prefill_endpoint}, "
-            f"lease_id={self.lease_id})"
-        )
-
-
-class BaseTensorrtLLMEngine:
-    def __init__(
-        self,
-        config: BaseEngineConfig,
-    ):
-        self._config = config
-        self._prefill_client = None
-        self._llm_engine = None
-        self._llm_engine_context = None
-        self._llm_publisher = None
-        self._llm_publisher_context = None
-        self._runtime = None
-        self._first_generation = True
-        # Initialize default sampling params
-        self.default_sampling_params = SamplingParams()
-
-    async def initialize(self, runtime: DistributedRuntime):
-        """Initialize the engine and prefill client if needed"""
-        self._runtime = runtime
-
-        # Convert model path to Path object if it's a local path, otherwise keep as string
-        model_path = str(self._config.model_path)
-
-        # Initialize the LLM engine
-        engine_args: dict[str, Any] = {
-            "model": model_path,
-            "tensor_parallel_size": 1,
-            "backend": "pytorch",
-            "skip_tokenizer_init": True,
-        }
-
-        if self._config.extra_engine_args:
-            # TODO: Support extra engine args from json file as well.
-            engine_args = update_llm_args_with_extra_options(
-                engine_args, self._config.extra_engine_args
-            )
-        # Update the model path in the config to the model path used by the engine.
-        self._config.model_path = str(engine_args["model"])
-        if not self._config.model_path:
-            raise ValueError(
-                "Model specification is required. Present neither in the config nor in the extra engine args."
-            )
-
-        # Populate default sampling params from the model
-        tokenizer = tokenizer_factory(self._config.model_path)
-        self.default_sampling_params = SamplingParams()
-        self.default_sampling_params._setup(tokenizer)
-        self.default_sampling_params.stop = None
-
-        if self._config.publish_events_and_metrics:
-            # 'event_buffer_max_size' is required to enable TRTLLM to publish kv cache events.
-            kv_cache_config: dict[str, Any] | Any = None
-            if "kv_cache_config" not in engine_args:
-                kv_cache_config = {}
-                kv_cache_config[
-                    "event_buffer_max_size"
-                ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
-            else:
-                kv_cache_config = engine_args["kv_cache_config"]
-                if (
-                    hasattr(kv_cache_config, "event_buffer_max_size")
-                    and not kv_cache_config.event_buffer_max_size
-                ):
-                    kv_cache_config.event_buffer_max_size = (
-                        DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
-                    )
-                elif (
-                    isinstance(kv_cache_config, dict)
-                    and "event_buffer_max_size" not in kv_cache_config
-                ):
-                    kv_cache_config[
-                        "event_buffer_max_size"
-                    ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE
-                engine_args["kv_cache_config"] = kv_cache_config
-
-            # Enable iter perf stats by default if we are publishing events and metrics.
-            if not engine_args.get("enable_iter_perf_stats"):
-                engine_args["enable_iter_perf_stats"] = True
-
-            # Only pytorch backend is supported for now to publish events and metrics.
-            if engine_args.get("backend") != "pytorch":
-                logging.error(
-                    "Only pytorch backend is supported for now to publish events and metrics."
-                )
-                raise RuntimeError(
-                    "Only pytorch backend is supported for now to publish events and metrics. Hence, KV router is not supported."
-                )
-
-        logging.info(f"TRTLLM engine args: {engine_args}")
-
-        # Get the engine using the asynccontextmanager
-        self._llm_engine_context = get_tensorrtllm_engine(engine_args)
-        if self._llm_engine_context is not None:
-            self._llm_engine = await self._llm_engine_context.__aenter__()
-        else:
-            raise RuntimeError("Failed to create LLM engine context")
-
-        if (
-            self._config.publish_events_and_metrics
-            and self._config.disaggregation_mode != "prefill"
-        ):
-            kv_listener = runtime.namespace(self._config.namespace).component(
-                self._config.component
-            )
-            self._llm_publisher_context = get_tensorrtllm_publisher(
-                kv_listener,
-                self._llm_engine,
-                kv_listener,
-                self._config.lease_id,
-                self._config.kv_block_size,
-            )
-            if self._llm_publisher_context is not None:
-                self._llm_publisher = await self._llm_publisher_context.__aenter__()
-            else:
-                raise RuntimeError("Failed to create LLM publisher context")
-
-        # Initialize prefill client if in decode mode
-        if self._config.disaggregation_mode == "decode":
-            if self._config.remote_prefill_endpoint is None:
-                raise ValueError("remote_prefill_endpoint is required for decode mode")
-            logging.info(
-                f"Initializing remote prefill client for endpoint: {self._config.remote_prefill_endpoint}"
-            )
-            (
-                parsed_namespace,
-                parsed_component_name,
-                parsed_endpoint_name,
-            ) = parse_endpoint(self._config.remote_prefill_endpoint)
-            if self._runtime is not None:
-                self._prefill_client = (
-                    await self._runtime.namespace(parsed_namespace)
-                    .component(parsed_component_name)
-                    .endpoint(parsed_endpoint_name)
-                    .client()
-                )
-            else:
-                raise RuntimeError("Runtime not initialized")
-
-    async def cleanup(self):
-        """Cleanup resources"""
-        if self._llm_publisher_context:
-            try:
-                await self._llm_publisher_context.__aexit__(None, None, None)
-            except Exception as e:
-                logging.error(f"Error during publisher cleanup: {e}")
-            finally:
-                self._llm_publisher = None
-                self._llm_publisher_context = None
-
-        if self._llm_engine_context:
-            try:
-                await self._llm_engine_context.__aexit__(None, None, None)
-            except Exception as e:
-                logging.error(f"Error during engine cleanup: {e}")
-            finally:
-                self._llm_engine = None
-                self._llm_engine_context = None
-
-        self._prefill_client = None
-
-    async def remote_prefill(self, request: TRTLLMWorkerRequest):
-        """
-        Send a prefill request to the remote prefill worker.
-
-        Args:
-            request: The original request to be sent for prefill
-
-        Returns:
-            The response from the remote prefill worker
-
-        Raises:
-            ValueError: If prefill client is not initialized or multiple responses received
-        """
-        prefill_request = request.model_copy(deep=True)
-        # TRTLLM requires max_tokens to be set for prefill requests.
-        prefill_request.stop_conditions.max_tokens = 1
-        prefill_request.disaggregated_params = OAIDisaggregatedParams(
-            request_type="context_only"
-        )
-
-        if self._prefill_client is None:
-            raise ValueError("Prefill client not initialized")
-        try:
-            # TODO: Use smart KV router to determine which prefill worker to use. This would also require supporting publishing events for prefill workers.
-            remote_prefill_responses = [
-                remote_prefill_response
-                async for remote_prefill_response in await self._prefill_client.round_robin(
-                    prefill_request.model_dump_json()
-                )
-            ]
-        except Exception as e:
-            raise ValueError(f"Error in remote prefill: {e}")
-
-        if len(remote_prefill_responses) > 1:
-            raise ValueError(
-                "Prefill worker returned more than one response. This is currently not supported in remote prefill mode."
-            )
-
-        if len(remote_prefill_responses) == 0:
-            raise ValueError("No response received from remote prefill worker")
-
-        remote_prefill_response = remote_prefill_responses[0]
-        return remote_prefill_response
-
-    async def generate(self, request: TRTLLMWorkerRequest):
-        if self._llm_engine is None:
-            raise RuntimeError("Engine not initialized")
-
-        if self._llm_publisher:
-            publishers_error = self._llm_publisher.check_error_queue()
-            if publishers_error:
-                raise publishers_error
-
-        inputs = request.token_ids
-
-        # Decode the disaggregated params from the request
-        disaggregated_params = DisaggregatedTypeConverter.to_llm_disaggregated_params(
-            request.disaggregated_params
-        )
-        num_output_tokens_so_far = 0
-
-        if self._config.disaggregation_mode == "decode":
-            # Run prefill/context phase remotely if disaggregation mode is decode.
-            try:
-                prefill_result = await self.remote_prefill(request)
-            except Exception as e:
-                raise ValueError(f"Error in remote prefill: {e}")
-
-            remote_prefill_response = prefill_result.data()
-            if (
-                remote_prefill_response["finish_reason"] == "stop"
-                or remote_prefill_response["finish_reason"] == "error"
-            ):
-                yield remote_prefill_response
-                return
-            num_output_tokens_so_far = len(remote_prefill_response["token_ids"])
-
-            # Decode the disaggregated params from the remote prefill response
-            # Decode the disaggregated params from the remote prefill response
-            disaggregated_params = (
-                DisaggregatedTypeConverter.to_llm_disaggregated_params(
-                    OAIDisaggregatedParams(
-                        **remote_prefill_response["disaggregated_params"]
-                    )
-                )
-            )
-
-            # Send the first token response to the client
-            first_token_response = remote_prefill_response
-            first_token_response.pop("disaggregated_params")
-            yield first_token_response
-
-            # Set the disaggregated params to generation_only for the rest of the generation
-            disaggregated_params.request_type = "generation_only"
-
-        sampling_params = self.default_sampling_params
-        for key, value in request.sampling_options.model_dump().items():
-            if not value:
-                continue
-            if hasattr(sampling_params, key):
-                setattr(sampling_params, key, value)
-
-        max_tokens = request.stop_conditions.max_tokens
-        if max_tokens:
-            sampling_params.max_tokens = max_tokens
-
-        ignore_eos = request.stop_conditions.ignore_eos
-        if ignore_eos:
-            sampling_params.ignore_eos = ignore_eos
-
-        # TODO: Disable streaming for context only requests when adding disagg support
-        async for res in self._llm_engine.llm.generate_async(
-            inputs=inputs,
-            sampling_params=sampling_params,
-            disaggregated_params=disaggregated_params,
-            streaming=(self._config.disaggregation_mode != "prefill"),
-        ):
-            # TRTLLM engine needs to start generating tokens first before stats
-            # can be retrieved.
-            if self._first_generation and self._llm_publisher:
-                self._llm_publisher.start()
-                self._first_generation = False
-
-            if res.finished and self._config.disaggregation_mode != "prefill":
-                yield {"finish_reason": "stop", "token_ids": []}
-                break
-
-            if not res.outputs:
-                yield {"finish_reason": "error", "token_ids": []}
-                break
-
-            output = res.outputs[0]
-            next_total_toks = len(output.token_ids)
-            out = {"token_ids": output.token_ids[num_output_tokens_so_far:]}
-            if output.finish_reason:
-                out["finish_reason"] = output.finish_reason
-            if output.stop_reason:
-                out["stop_reason"] = output.stop_reason
-            if self._config.disaggregation_mode == "prefill":
-                # Return the disaggregated params only when operating in prefill mode.
-                out[
-                    "disaggregated_params"
-                ] = DisaggregatedTypeConverter.to_oai_disaggregated_params(
-                    output.disaggregated_params
-                ).model_dump()
-
-            yield out
-            num_output_tokens_so_far = next_total_toks
diff --git a/examples/tensorrt_llm_sd/common/parser.py b/examples/tensorrt_llm_sd/common/parser.py
deleted file mode 100644
index 67bb230796..0000000000
--- a/examples/tensorrt_llm_sd/common/parser.py
+++ /dev/null
@@ -1,62 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-
-
-def parse_tensorrt_llm_args(
-    config_args,
-) -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="A TensorRT-LLM Worker parser")
-    parser.add_argument(
-        "--extra-engine-args",
-        type=str,
-        default="",
-        help="Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.",
-    )
-    parser.add_argument(
-        "--model-path",
-        type=str,
-        default=None,
-        help="Path to disk model or HuggingFace model identifier to load.",
-    )
-    parser.add_argument(
-        "--served_model_name",
-        type=str,
-        help="Name to serve the model under.",
-    )
-    parser.add_argument(
-        "--router",
-        type=str,
-        choices=["random", "round-robin", "kv"],
-        default="random",
-        help="Router type to use for scheduling requests to workers",
-    )
-
-    parser.add_argument(
-        "--kv-block-size",
-        type=int,
-        default=32,
-        help="Number of tokens per KV block in TRTLLM worker. Default is 32 for pytorch backend.",
-    )
-
-    parser.add_argument(
-        "--enable-disagg",
-        action="store_true",
-        help="Enable remote prefill for the worker",
-    )
-
-    args = parser.parse_args(config_args)
-    return args
diff --git a/examples/tensorrt_llm_sd/common/protocol.py b/examples/tensorrt_llm_sd/common/protocol.py
deleted file mode 100644
index f05cdb9f8f..0000000000
--- a/examples/tensorrt_llm_sd/common/protocol.py
+++ /dev/null
@@ -1,104 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import base64
-from typing import List, Optional
-
-from pydantic import BaseModel, Field
-from tensorrt_llm.llmapi import DisaggregatedParams as LlmDisaggregatedParams
-from tensorrt_llm.serve.openai_protocol import DisaggregatedParams
-
-
-class Tokens(BaseModel):
-    tokens: list[int]
-
-
-TokenIdType = int
-
-
-class DisaggregatedTypeConverter:
-    @staticmethod
-    def to_llm_disaggregated_params(
-        disaggregated_params: DisaggregatedParams,
-    ) -> LlmDisaggregatedParams:
-        if disaggregated_params is None:
-            return None
-        else:
-            opaque_state = (
-                base64.b64decode(disaggregated_params.encoded_opaque_state)
-                if disaggregated_params.encoded_opaque_state is not None
-                else None
-            )
-
-            return LlmDisaggregatedParams(
-                request_type=disaggregated_params.request_type,
-                first_gen_tokens=disaggregated_params.first_gen_tokens,
-                ctx_request_id=disaggregated_params.ctx_request_id,
-                opaque_state=opaque_state,
-            )
-
-    @staticmethod
-    def to_oai_disaggregated_params(
-        tllm_disagg_params: LlmDisaggregatedParams,
-    ) -> DisaggregatedParams:
-        if tllm_disagg_params is None:
-            return None
-        else:
-            encoded_opaque_state = (
-                base64.b64encode(tllm_disagg_params.opaque_state).decode("utf-8")
-                if tllm_disagg_params.opaque_state is not None
-                else None
-            )
-            return DisaggregatedParams(
-                request_type=tllm_disagg_params.request_type,
-                first_gen_tokens=tllm_disagg_params.first_gen_tokens,
-                ctx_request_id=tllm_disagg_params.ctx_request_id,
-                encoded_opaque_state=encoded_opaque_state,
-            )
-
-
-# TODO: move these to common for all LLMs once we adopt dynamo-run
-# derived from lib/llm/src/protocols/common/preprocessor.rs
-class StopConditions(BaseModel):
-    max_tokens: Optional[int] = None
-    stop: Optional[List[str]] = None
-    stop_token_ids_hidden: Optional[List[TokenIdType]] = None
-    min_tokens: Optional[int] = None
-    ignore_eos: Optional[bool] = None
-
-
-class SamplingOptions(BaseModel):
-    n: Optional[int] = None
-    best_of: Optional[int] = None
-    presence_penalty: Optional[float] = None
-    frequency_penalty: Optional[float] = None
-    repetition_penalty: Optional[float] = None
-    temperature: Optional[float] = None
-    top_p: Optional[float] = None
-    top_k: Optional[int] = None
-    min_p: Optional[float] = None
-    use_beam_search: Optional[bool] = None
-    length_penalty: Optional[float] = None
-    seed: Optional[int] = None
-
-
-class TRTLLMWorkerRequest(BaseModel):
-    token_ids: List[TokenIdType]
-    stop_conditions: StopConditions
-    sampling_options: SamplingOptions
-    eos_token_ids: List[TokenIdType] = Field(default_factory=list)
-    mdc_sum: Optional[str] = None
-    annotations: List[str] = Field(default_factory=list)
-    estimated_prefix_hit_num_blocks: Optional[int] = None
-    disaggregated_params: Optional[DisaggregatedParams] = Field(default=None)
diff --git a/examples/tensorrt_llm_sd/components/frontend.py b/examples/tensorrt_llm_sd/components/frontend.py
deleted file mode 100644
index 98be2dfa33..0000000000
--- a/examples/tensorrt_llm_sd/components/frontend.py
+++ /dev/null
@@ -1,119 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import logging
-import subprocess
-from pathlib import Path
-
-from components.worker import TensorRTLLMWorker
-from fastapi import FastAPI
-from pydantic import BaseModel
-
-from dynamo import sdk
-from dynamo.sdk import depends, service
-from dynamo.sdk.lib.config import ServiceConfig
-from dynamo.sdk.lib.image import DYNAMO_IMAGE
-
-logger = logging.getLogger(__name__)
-
-
-def get_dynamo_run_binary():
-    """Find the dynamo-run binary path in SDK or fallback to 'dynamo-run' command."""
-    sdk_path = Path(sdk.__file__)
-    binary_path = sdk_path.parent / "cli/bin/dynamo-run"
-    if not binary_path.exists():
-        return "dynamo-run"
-    else:
-        return str(binary_path)
-
-
-class FrontendConfig(BaseModel):
-    """Configuration for the Frontend service including model and HTTP server settings."""
-
-    served_model_name: str
-    endpoint: str
-    port: int = 8000
-    router: str = "round-robin"
-    block_size: int = 32
-
-
-# todo this should be called ApiServer
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    workers=1,
-    image=DYNAMO_IMAGE,
-    app=FastAPI(title="TensorRT-LLM Example"),
-)
-class Frontend:
-    worker = depends(TensorRTLLMWorker)
-
-    def __init__(self):
-        """Initialize Frontend service with HTTP server and model configuration."""
-        self.frontend_config = FrontendConfig(
-            **ServiceConfig.get_parsed_config("Frontend")
-        )
-        self.process = None
-
-        logger.warning(f"Frontend config: {self.frontend_config}")
-
-        self.start_ingress_and_processor()
-
-    def start_ingress_and_processor(self):
-        """Starting dynamo-run based ingress and processor"""
-        logger.info(
-            f"Starting HTTP server and processor on port {self.frontend_config.port}"
-        )
-        dynamo_run_binary = get_dynamo_run_binary()
-
-        cmd = [
-            dynamo_run_binary,
-            "in=http",
-            "out=dyn",
-            "--http-port",
-            str(self.frontend_config.port),
-            "--router-mode",
-            self.frontend_config.router,
-        ]
-
-        logger.info(f"Frontend cmd: {cmd}")
-
-        self.process = subprocess.Popen(
-            cmd,
-            stdout=None,
-            stderr=None,
-        )
-
-    def close(self):
-        """Clean up resources by terminating the subprocess."""
-        if self.process is not None:
-            try:
-                logger.info("Terminating subprocess...")
-                self.process.terminate()
-                # Wait for process to terminate with a timeout
-                self.process.wait(timeout=5)
-            except subprocess.TimeoutExpired:
-                logger.warning("Subprocess did not terminate gracefully, forcing kill")
-                self.process.kill()
-                self.process.wait()
-            except Exception as e:
-                logger.error(f"Error while terminating subprocess: {e}")
-            finally:
-                self.process = None
-
-    def __del__(self):
-        """Destructor to ensure subprocess is cleaned up."""
-        self.close()
diff --git a/examples/tensorrt_llm_sd/components/prefill_worker.py b/examples/tensorrt_llm_sd/components/prefill_worker.py
deleted file mode 100644
index 7e43d1fca7..0000000000
--- a/examples/tensorrt_llm_sd/components/prefill_worker.py
+++ /dev/null
@@ -1,75 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import logging
-
-from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine
-from common.parser import parse_tensorrt_llm_args
-from common.protocol import TRTLLMWorkerRequest
-
-from dynamo.sdk import async_on_start, dynamo_context, endpoint, on_shutdown, service
-from dynamo.sdk.lib.config import ServiceConfig
-
-logger = logging.getLogger(__name__)
-
-
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    resources={"gpu": 1, "cpu": "10", "memory": "20Gi"},
-    workers=1,
-)
-class TensorRTLLMPrefillWorker(BaseTensorrtLLMEngine):
-    def __init__(self):
-        logger.info("Initializing TensorRT-LLM Prefill Worker")
-        class_name = self.__class__.__name__
-        config = ServiceConfig.get_instance()
-        config_args = config.as_args(class_name, prefix="")
-        args = parse_tensorrt_llm_args(config_args)
-        lease_id = dynamo_context["endpoints"][0].lease_id()
-        namespace, _ = TensorRTLLMPrefillWorker.dynamo_address()  # type: ignore
-
-        engine_config = BaseEngineConfig(
-            namespace=namespace,
-            component=class_name,
-            endpoint="generate",
-            model_path=args.model_path,
-            served_model_name=args.served_model_name,
-            kv_block_size=args.kv_block_size,
-            extra_engine_args=args.extra_engine_args,
-            publish_events_and_metrics=False,
-            disaggregation_mode="prefill",
-            remote_prefill_endpoint=None,
-            lease_id=lease_id,
-        )
-
-        super().__init__(config=engine_config)
-
-    @async_on_start
-    async def async_init(self):
-        runtime = dynamo_context["runtime"]
-        await self.initialize(runtime)
-        logger.info("TensorRT-LLM Prefill Worker initialized")
-
-    @on_shutdown
-    async def async_cleanup(self):
-        logger.info("Cleaning up TensorRT-LLM Prefill Worker")
-        await self.cleanup()
-        logger.info("TensorRT-LLM Prefill Worker cleanup completed")
-
-    @endpoint()
-    async def generate(self, request: TRTLLMWorkerRequest):
-        async for response in super().generate(request):
-            yield response
diff --git a/examples/tensorrt_llm_sd/components/worker.py b/examples/tensorrt_llm_sd/components/worker.py
deleted file mode 100644
index 9074bfbe8d..0000000000
--- a/examples/tensorrt_llm_sd/components/worker.py
+++ /dev/null
@@ -1,115 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import logging
-
-from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine
-from common.parser import parse_tensorrt_llm_args
-from common.protocol import TRTLLMWorkerRequest
-from components.prefill_worker import TensorRTLLMPrefillWorker
-
-from dynamo.llm import ModelType, register_llm
-from dynamo.sdk import (
-    async_on_start,
-    depends,
-    dynamo_context,
-    endpoint,
-    on_shutdown,
-    service,
-)
-from dynamo.sdk.lib.config import ServiceConfig
-
-logger = logging.getLogger(__name__)
-
-
-@service(
-    dynamo={
-        "namespace": "dynamo",
-    },
-    resources={"gpu": 1, "cpu": "10", "memory": "20Gi"},
-    workers=1,
-)
-class TensorRTLLMWorker(BaseTensorrtLLMEngine):
-    prefill_worker = depends(TensorRTLLMPrefillWorker)
-
-    def __init__(self):
-        logger.info("Initializing TensorRT-LLM Worker")
-        class_name = self.__class__.__name__
-        config = ServiceConfig.get_instance()
-        config_args = config.as_args(class_name, prefix="")
-        args = parse_tensorrt_llm_args(config_args)
-        lease_id = dynamo_context["endpoints"][0].lease_id()
-        namespace, _ = TensorRTLLMWorker.dynamo_address()  # type: ignore
-        endpoint_name = "generate"
-        publish_events_and_metrics = args.router == "kv"
-        prefill_class_name = "TensorRTLLMPrefillWorker"
-
-        if args.enable_disagg:
-            disaggregation_mode = "decode"
-        else:
-            disaggregation_mode = "prefill_and_decode"
-
-        engine_config = BaseEngineConfig(
-            namespace=namespace,
-            component=class_name,
-            endpoint=endpoint_name,
-            model_path=args.model_path,
-            served_model_name=args.served_model_name,
-            kv_block_size=args.kv_block_size,
-            extra_engine_args=args.extra_engine_args,
-            publish_events_and_metrics=publish_events_and_metrics,
-            disaggregation_mode=disaggregation_mode,
-            remote_prefill_endpoint=f"dyn://{namespace}.{prefill_class_name}.generate",
-            lease_id=lease_id,
-        )
-
-        super().__init__(config=engine_config)
-
-    @async_on_start
-    async def async_init(self):
-        runtime = dynamo_context["runtime"]
-        await self.initialize(runtime)
-
-        logger.info("Registering LLM for discovery")
-        endpoint = (
-            runtime.namespace(self._config.namespace)
-            .component(self._config.component)
-            .endpoint(self._config.endpoint)
-        )
-
-        try:
-            await register_llm(
-                ModelType.Backend,
-                endpoint,
-                self._config.model_path,
-                self._config.served_model_name,
-                kv_cache_block_size=self._config.kv_block_size,
-            )
-            logger.info("Successfully registered LLM for discovery")
-        except Exception as e:
-            logger.error(f"Failed to register LLM for discovery: {e}")
-            raise
-
-        logger.info("TensorRT-LLM Worker initialized")
-
-    @on_shutdown
-    async def async_cleanup(self):
-        logger.info("Cleaning up TensorRT-LLM Worker")
-        await self.cleanup()
-        logger.info("TensorRT-LLM Worker cleanup completed")
-
-    @endpoint()
-    async def generate(self, request: TRTLLMWorkerRequest):
-        async for response in super().generate(request):
-            yield response
diff --git a/examples/tensorrt_llm_sd/configs/agg.yaml b/examples/tensorrt_llm_sd/configs/agg.yaml
deleted file mode 100644
index a3d4594ed8..0000000000
--- a/examples/tensorrt_llm_sd/configs/agg.yaml
+++ /dev/null
@@ -1,34 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Name to serve the model under
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/agg_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
\ No newline at end of file
diff --git a/examples/tensorrt_llm_sd/configs/agg_router.yaml b/examples/tensorrt_llm_sd/configs/agg_router.yaml
deleted file mode 100644
index 58f2a82ab3..0000000000
--- a/examples/tensorrt_llm_sd/configs/agg_router.yaml
+++ /dev/null
@@ -1,34 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: kv
-
-TensorRTLLMWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Name to serve the model under
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/agg_config.yaml"
-  router: kv
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
\ No newline at end of file
diff --git a/examples/tensorrt_llm_sd/configs/disagg.yaml b/examples/tensorrt_llm_sd/configs/disagg.yaml
deleted file mode 100644
index 73c202f4be..0000000000
--- a/examples/tensorrt_llm_sd/configs/disagg.yaml
+++ /dev/null
@@ -1,48 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: nvidia/Llama-4-Maverick-17B-128E-Eagle3
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
-  # Name to serve the model under
-  served_model_name: nvidia/Llama-4-Maverick-17B-128E-Eagle3
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/decode_config.yaml"
-  enable-disagg: true
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-
-TensorRTLLMPrefillWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-
diff --git a/examples/tensorrt_llm_sd/configs/disagg_router.yaml b/examples/tensorrt_llm_sd/configs/disagg_router.yaml
deleted file mode 100644
index faae7f65a3..0000000000
--- a/examples/tensorrt_llm_sd/configs/disagg_router.yaml
+++ /dev/null
@@ -1,47 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: kv
-
-TensorRTLLMWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Name to serve the model under
-  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/decode_config.yaml"
-  enable-disagg: true
-  router: kv
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
-
-TensorRTLLMPrefillWorker:
-  # Path to disk model or HuggingFace model identifier to load
-  model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 1
\ No newline at end of file
diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml
deleted file mode 100644
index 02b5cd8463..0000000000
--- a/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.95
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-use_cuda_graph: true
diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml
deleted file mode 100644
index eb943fd6e7..0000000000
--- a/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml
+++ /dev/null
@@ -1,27 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-disable_overlap_scheduler: false
-use_cuda_graph: true
-kv_cache_config:
-  free_gpu_memory_fraction: 0.95
-
diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml
deleted file mode 100644
index 5dee9e653d..0000000000
--- a/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml
+++ /dev/null
@@ -1,28 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-tensor_parallel_size: 1
-moe_expert_parallel_size: 1
-enable_attention_dp: false
-max_num_tokens: 8192
-max_batch_size: 16
-trust_remote_code: true
-backend: pytorch
-enable_chunked_prefill: true
-# Overlap scheduler not currently supported in prefill only workers.
-disable_overlap_scheduler: true
-use_cuda_graph: false
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.95
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml
deleted file mode 100644
index 3ac3facedd..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  # This is the client-facing model name, you can set this to anything you'd like.
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  extra-engine-args: "configs/llama4_eagle3/engine_configs/agg_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml
deleted file mode 100644
index 9d96befbe5..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml
+++ /dev/null
@@ -1,49 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  # This is the client-facing model name, you can set this to anything you'd like.
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  extra-engine-args: "configs/deepseek_r1/engine_configs/decode_config.yaml"
-  enable-disagg: true
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
-
-TensorRTLLMPrefillWorker:
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  extra-engine-args: "configs/deepseek_r1/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml
deleted file mode 100644
index 29dddba56f..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml
+++ /dev/null
@@ -1,54 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: false
-
-max_batch_size: 256
-# 8448 = 8192 ISL + 256 OSL
-max_num_tokens: 8448
-max_seq_len: 8448
-
-kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
-  free_gpu_memory_fraction: 0.85
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  # free_gpu_memory_fraction: 0.30
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-use_cuda_graph: true
-cuda_graph_padding_enabled: true
-# NOTE: For larger max batch size, you may want to add larger cuda graph
-# batch sizes below to match.
-cuda_graph_batch_sizes:
-- 1
-- 2
-- 4
-- 8
-- 16
-- 32
-- 64
-- 128
-- 256
-print_iter_log: true
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml
deleted file mode 100644
index 772b94b283..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml
+++ /dev/null
@@ -1,55 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: false
-
-max_batch_size: 256
-max_num_tokens: 256
-# 8448 = 8192 ISL + 256 OSL
-max_seq_len: 8448
-
-kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
-  free_gpu_memory_fraction: 0.85
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  # free_gpu_memory_fraction: 0.30
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: false
-use_cuda_graph: true
-cuda_graph_padding_enabled: true
-# NOTE: For larger max batch size, you may want to add larger cuda graph
-# batch sizes below to match.
-cuda_graph_batch_sizes:
-- 1
-- 2
-- 4
-- 8
-- 16
-- 32
-- 64
-- 128
-- 256
-print_iter_log: true
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml
deleted file mode 100644
index 6ae899a68a..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml
+++ /dev/null
@@ -1,37 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# TP/EP/PP/DP
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-pipeline_parallel_size: 1
-enable_attention_dp: true
-
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.75
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: true
-print_iter_log: true
-# NOTE: This dtype must match in both prefill/decode configs
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml
deleted file mode 100644
index f0b5411221..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml
+++ /dev/null
@@ -1,50 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: true
-max_batch_size: 256
-# 8448 = 8192 ISL + 256 OSL
-max_num_tokens: 8448
-max_seq_len: 8448
-kv_cache_config:
-  free_gpu_memory_fraction: 0.30
-
-# Enable the MTP(Multi-Token Prediction) in the model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-
-use_cuda_graph: true
-cuda_graph_padding_enabled: true
-cuda_graph_batch_sizes:
-- 1
-- 2
-- 4
-- 8
-- 16
-- 32
-- 64
-- 128
-- 256
-print_iter_log: true
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml
deleted file mode 100644
index ab48b2e78b..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml
+++ /dev/null
@@ -1,53 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: false
-max_batch_size: 256
-# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
-# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
-# This is a known issue in TensorRT-LLM and will be resolved in the next release.
-max_num_tokens: 512
-# 8704 = 8192 ISL + 512 OSL
-max_seq_len: 8704
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
-
-# Enable the MTP(Multi-Token Prediction) in decode model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-
-use_cuda_graph: true
-cuda_graph_padding_enabled: true
-cuda_graph_batch_sizes:
-- 1
-- 2
-- 4
-- 8
-- 16
-- 32
-- 64
-- 128
-- 256
-print_iter_log: true
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml
deleted file mode 100644
index ee6ee26a94..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml
+++ /dev/null
@@ -1,37 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
-backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 4
-enable_attention_dp: true
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-kv_cache_config:
-  free_gpu_memory_fraction: 0.75
-print_iter_log: true
-kv_cache_dtype: fp8
-disable_overlap_scheduler: true
-
-# Enable the MTP(Multi-Token Prediction) in the prefill model engine
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml
deleted file mode 100644
index 626ca27953..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml
deleted file mode 100644
index 5fe2679809..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml
+++ /dev/null
@@ -1,52 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-Frontend:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  endpoint: dynamo.TensorRTLLMWorker.generate
-  port: 8000
-  router: round-robin
-
-TensorRTLLMWorker:
-  served_model_name: "nvidia/DeepSeek-R1-FP4"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/decode_config.yaml"
-  router: round-robin
-  enable-disagg: true
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
-
-TensorRTLLMPrefillWorker:
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/DeepSeek-R1-FP4"
-  # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
-  # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
-  extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml"
-  router: round-robin
-  ServiceArgs:
-    workers: 1
-    resources:
-      gpu: 4
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md
deleted file mode 100644
index 342cd45129..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md
+++ /dev/null
@@ -1,275 +0,0 @@
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
--->
-
-# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
-
-To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
-the set of nodes need to be launched together in the same MPI world, such as
-via `mpirun` or `srun`. This is true regardless of whether the worker is
-aggregated, prefill-only, or decode-only.
-
-In this document we will demonstrate two examples launching multinode workers
-on a slurm cluster with `srun`:
-1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
-   worker across 4 GB200 nodes
-2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
-   TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
-   worker (4 nodes) across a total of 8 GB200 nodes.
-
-NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
-`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
-using `mpirun` directly, with relative ease.
-
-## Setup
-
-For simplicity of the example, we will make some assumptions about your slurm cluster:
-1. First, we assume you have access to a slurm cluster with multiple GPU nodes
-   available. For functional testing, most setups should be fine. For performance
-   testing, you should aim to allocate groups of nodes that are performantly
-   inter-connected, such as those in an NVL72 setup.
-2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
-   SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
-   example will use `srun` arguments like `--container-image`,
-   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
-   If your cluster supports similar container based plugins, you may be able to
-   modify the script to use that instead.
-3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
-   described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker).
-   This is the image that can be set to the `IMAGE` environment variable in later steps.
-4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
-   will allocate 8 nodes below as a reference command to have enough capacity
-   to run both examples. If you plan to only run the aggregated example, you
-   will only need 4 nodes. If you customize the configurations to require a
-   different number of nodes, you can adjust the number of allocated nodes
-   accordingly. Pre-allocating nodes is technically not a requirement,
-   but it makes iterations of testing/experimenting easier.
-
-   Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
-    ```bash
-    # Set partition manually based on your slurm cluster's partition names
-    PARTITION=""
-    # Set account manually if this command doesn't work on your cluster
-    ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
-    salloc \
-      --partition="${PARTITION}" \
-      --account="${ACCOUNT}" \
-      --job-name="${ACCOUNT}-dynamo.trtllm" \
-      -t 05:00:00 \
-      --nodes 8
-    ```
-5. Lastly, we will assume you are inside an interactive shell on one of your allocated
-   nodes, which may be the default behavior after executing the `salloc` command above
-   depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
-
-### Environment Variable Setup
-
-This example aims to automate as much of the environment setup as possible,
-but all slurm clusters and environments are different, and you may need to
-dive into the scripts to make modifications based on your specific environment.
-
-Assuming you have already allocated your nodes via `salloc`, and are
-inside an interactive shell on one of the allocated nodes, set the
-following environment variables based:
-```bash
-# NOTE: IMAGE must be set manually for now
-# To build an iamge, see the steps here:
-# https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker
-export IMAGE="<dynamo_trtllm_image>"
-
-# MOUNTS are the host:container path pairs that are mounted into the containers
-# launched by each `srun` command.
-#
-# If you want to reference files, such as $MODEL_PATH below, in a
-# different location, you can customize MOUNTS or specify additional
-# comma-separated mount pairs here.
-#
-# NOTE: Currently, this example assumes that the local bash scripts and configs
-# referenced are mounted into into /mnt inside the container. If you want to
-# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
-# accordingly for the new locations of `start_frontend_services.sh` and
-# `start_trtllm_worker.sh`.
-#
-# For example, assuming your cluster had a `/lustre` directory on the host, you
-# could add that as a mount like so:
-#
-# export MOUNTS="${PWD}:/mnt,/lustre:/lustre"
-export MOUNTS="${PWD}:/mnt"
-
-# NOTE: In general, Deepseek R1 is very large, so it is recommended to
-# pre-download the model weights and save them in some shared location,
-# NFS storage, HF_CACHE, etc. and modify the `--model-path` below
-# to reuse the pre-downloaded weights instead.
-#
-# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-#
-# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
-# https://huggingface.co/deepseek-ai/DeepSeek-R1
-export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
-
-# The name the model will be served/queried under, matching what's
-# returned by the /v1/models endpoint.
-#
-# By default this is inferred from MODEL_PATH, but when using locally downloaded
-# model weights, it can be nice to have explicit control over the name.
-export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
-```
-
-## Aggregated WideEP
-
-Assuming you have at least 4 nodes allocated following the setup steps above,
-follow these steps below to launch an **aggregated** deployment across 4 nodes:
-
-```bash
-# Default set in srun_aggregated.sh, but can customize here.
-# export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg.yaml"
-
-# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
-# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
-# total GPUs necessary to satisfy the requested parallelism. For example,
-# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
-# export NUM_NODES=4
-
-# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
-# export NUM_GPUS_PER_NODE=4
-
-# Launches:
-# - frontend + etcd/nats on current (head) node
-# - one large aggregated trtllm worker across multiple nodes via MPI tasks
-./srun_aggregated.sh
-```
-
-## Disaggregated WideEP
-
-Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
-following the setup above, follow these steps below to launch a **disaggregated**
-deployment across 8 nodes:
-
-> [!Tip]
-> Make sure you have a fresh environment and don't still have the aggregated
-> example above still deployed on the same set of nodes.
-
-```bash
-# Defaults set in srun_disaggregated.sh, but can customize here.
-# export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml"
-# export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml"
-
-# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
-# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
-# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
-# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
-# GPUs necessary to satisfy the requested parallelism in each config.
-# export NUM_PREFILL_NODES=4
-# export NUM_DECODE_NODES=4
-
-# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
-# export NUM_GPUS_PER_NODE=4
-
-# Launches:
-# - frontend + etcd/nats on current (head) node.
-# - one large prefill trtllm worker across multiple nodes via MPI tasks
-# - one large decode trtllm worker across multiple nodes via MPI tasks
-./srun_disaggregated.sh
-```
-
-## Understanding the Output
-
-1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
-   etcd, NATS, and the OpenAI frontend on the head node only
-   called "node1" in the example output below. The second launches
-   a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
-   using 4 GPUs each.
-    ```
-    # Frontend/etcd/nats services
-    srun: launching StepId=453374.17 on host node1, 1 tasks: 0
-    ...
-    # TP16 TRTLLM worker split across 4 nodes with 4 gpus each
-    srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3]
-    srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7]
-    srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11]
-    srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15]
-   ```
-2. The OpenAI frontend will listen for and dynamically discover workers as
-   they register themselves with Dynamo's distributed runtime:
-   ```
-   0: 2025-06-13T02:36:48.160Z  INFO dynamo_run::input::http: Watching for remote model at models
-   0: 2025-06-13T02:36:48.161Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
-   ```
-3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
-   GPU on each node, which will each output their progress while loading the model.
-   You can see each rank's output prefixed with the rank at the start of each log line
-   until the model succesfully finishes loading:
-    ```
-     8: rank8 run mgmn worker node with mpi_world_size: 16 ...
-    10: rank10 run mgmn worker node with mpi_world_size: 16 ...
-     9: rank9 run mgmn worker node with mpi_world_size: 16 ...
-    11: rank11 run mgmn worker node with mpi_world_size: 16 ...
-    ...
-    15: Model init total -- 55.42s
-    11: Model init total -- 55.91s
-    12: Model init total -- 55.24s
-    ```
-4. After the model fully finishes loading on all ranks, the worker will register itself,
-   and the OpenAI frontend will detect it, signaled by this output:
-    ```
-    0: 2025-06-13T02:46:35.040Z  INFO dynamo_llm::discovery::watcher: added model model_name="nvidia/DeepSeek-R1-FP4"
-    ```
-5. At this point, with the worker fully initialized and detected by the frontend,
-   it is now ready for inference.
-6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
-   three srun jobs instead of two. One for frontend, one for prefill worker,
-   and one for decode worker.
-
-## Example Request
-
-To verify the deployed model is working, send a `curl` request:
-```bash
-# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead.
-HOST=localhost
-PORT=8000
-# "model" here should match the model name returned by the /v1/models endpoint
-curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-  "model": "'${SERVED_MODEL_NAME}'",
-  "messages": [
-  {
-    "role": "user",
-    "content": "Tell me a story as if we were playing dungeons and dragons."
-  }
-  ],
-  "stream": true,
-  "max_tokens": 30
-}'
-```
-
-## Cleanup
-
-To cleanup background `srun` processes launched by `srun_aggregated.sh` or
-`srun_disaggregated.sh`, you can run:
-```bash
-pkill srun
-```
-
-## Known Issues
-
-- This example has only been tested on a 4xGB200 node setup with 16 GPUs using
-  FP4 weights. In theory, the example should work on alternative setups such as
-  H100 nodes with FP8 weights, but this hasn't been tested yet.
-- This example only tests an aggregated model setup for now. A disaggregated
-  serving example will be added in the near future.
-- WideEP configs in this directory are still being tested. A WideEP specific
-  example with documentation will be added once ready.
-- There are known issues where WideEP workers may not cleanly shut down:
-    - This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
-      now, you must manually clean these up before deploying again on the
-      same set of nodes.
-    - Similarly, there may be GPU memory left in-use after killing the `srun`
-      jobs. After cleaning up any leftover shared memory files as described
-      above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
-      to check on this behavior. If you don't free the GPU memory before the
-      next deployment, you may get a CUDA OOM error while loading the model.
-    - There is mention of this issue in the relevant TRT-LLM blog
-      [here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml
deleted file mode 100644
index d697caacfa..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml
+++ /dev/null
@@ -1,27 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Example of a Multi-node worker, but no WideEP or EPLB.
-# See wide_ep*.yaml for WideEP example configs.
-backend: pytorch
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-enable_attention_dp: true
-max_batch_size: 256
-max_num_tokens: 256
-max_seq_len: 8448
-kv_cache_config:
-  free_gpu_memory_fraction: 0.7
-use_cuda_graph: true
-cuda_graph_padding_enabled: true
-cuda_graph_batch_sizes:
-- 1
-- 2
-- 4
-- 8
-- 16
-- 32
-- 64
-- 128
-- 256
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml
deleted file mode 100644
index f2fe0a13c6..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml
+++ /dev/null
@@ -1,7 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-# moe_load_balancer settings for TRTLLM based on:
-# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer
-num_slots: 288
-layer_updates_per_iter: 2
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml
deleted file mode 100644
index 5bbc66bd69..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml
+++ /dev/null
@@ -1,35 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-backend: pytorch
-
-# WideEP related settings
-moe_backend: WideEP
-# moe_max_num_tokens will default to max_num_tokens if left unspecified.
-#
-# If you want to set this value explicitly, one recommendation is below:
-#   moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
-#   4096 = 256 * 16
-# moe_max_num_tokens: 4096
-moe_load_balancer: /mnt/engine_configs/eplb.yaml
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-
-enable_attention_dp: true
-max_batch_size: 256
-max_num_tokens: 256
-max_seq_len: 8448
-kv_cache_config:
-  free_gpu_memory_fraction: 0.7
-use_cuda_graph: true
-cuda_graph_padding_enabled: true
-cuda_graph_batch_sizes:
-- 1
-- 2
-- 4
-- 8
-- 16
-- 32
-- 64
-- 128
-- 256
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml
deleted file mode 100644
index ac7fc7e8f6..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml
+++ /dev/null
@@ -1,59 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# WideEP related settings
-moe_backend: WideEP
-moe_load_balancer: /mnt/engine_configs/eplb.yaml
-
-# TP/EP/PP/DP
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-pipeline_parallel_size: 1
-enable_attention_dp: true
-
-max_batch_size: 256
-max_num_tokens: 256
-# 8448 = 8192 ISL + 256 OSL
-max_seq_len: 8448
-
-kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
-  # free_gpu_memory_fraction: 0.85
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  free_gpu_memory_fraction: 0.30
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: false
-use_cuda_graph: true
-cuda_graph_padding_enabled: true
-# NOTE: For larger max batch size, you may want to add larger cuda graph
-# batch sizes below to match.
-cuda_graph_batch_sizes:
-- 1
-- 2
-- 4
-- 8
-- 16
-- 32
-- 64
-- 128
-- 256
-print_iter_log: true
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml
deleted file mode 100644
index 06968a3a76..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml
+++ /dev/null
@@ -1,41 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-backend: pytorch
-
-# WideEP related settings
-moe_backend: WideEP
-moe_load_balancer: /mnt/engine_configs/eplb.yaml
-
-# TP/EP/PP/DP
-tensor_parallel_size: 16
-moe_expert_parallel_size: 16
-pipeline_parallel_size: 1
-enable_attention_dp: true
-
-max_batch_size: 1
-max_num_tokens: 8192
-max_seq_len: 8192
-
-kv_cache_config:
-  free_gpu_memory_fraction: 0.75
-
-# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
-# NOTE: overlap_scheduler enabled by default since this commit and changed
-# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
-# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
-disable_overlap_scheduler: true
-print_iter_log: true
-# NOTE: This dtype must match in both prefill/decode configs
-kv_cache_dtype: fp8
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh
deleted file mode 100755
index 5a632551b9..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh
+++ /dev/null
@@ -1,75 +0,0 @@
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-# This is one of the only variables that must be set currently, most of the rest may
-# just work out of the box if following the steps in the README.
-IMAGE="${IMAGE:-""}"
-
-# Set to mount current host directory to /mnt inside the container as an example,
-# but you may freely customize the mounts based on your cluster. A common practice
-# is to mount paths to NFS storage for common scripts, model weights, etc.
-# NOTE: This can be a comma separated list of multiple mounts as well.
-DEFAULT_MOUNT="${PWD}:/mnt"
-MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
-
-# Example values, assuming 4 nodes with 4 GPUs on each node, such as 4xGB200 nodes.
-# For 8xH100 nodes as an example, you may set this to 2 nodes x 8 gpus/node instead.
-NUM_NODES=${NUM_NODES:-4}
-NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
-
-export ENGINE_CONFIG="${ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_agg.yaml}"
-
-# Automate settings of certain variables for convenience, but you are free
-# to manually set these for more control as well.
-ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
-export HEAD_NODE="${SLURMD_NODENAME}"
-export HEAD_NODE_IP="$(hostname -i)"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-
-if [[ -z ${IMAGE} ]]; then
-  echo "ERROR: You need to set the IMAGE environment variable to the " \
-       "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
-       "See how to build one from source here: " \
-       "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker"
-  exit 1
-fi
-
-# NOTE: Output streamed to stdout for ease of understanding the example, but
-# in practice you would probably set `srun --output ... --error ...` to pipe
-# the stdout/stderr to files.
-echo "Launching frontend services in background."
-srun \
-  --overlap \
-  --container-image "${IMAGE}" \
-  --container-mounts "${MOUNTS}" \
-  --verbose \
-  --label \
-  -A "${ACCOUNT}" \
-  -J "${ACCOUNT}-dynamo.trtllm" \
-  --nodelist "${HEAD_NODE}" \
-  --nodes 1 \
-  --jobid "${SLURM_JOB_ID}" \
-  /mnt/start_frontend_services.sh &
-
-# NOTE: Output streamed to stdout for ease of understanding the example, but
-# in practice you would probably set `srun --output ... --error ...` to pipe
-# the stdout/stderr to files.
-echo "Launching multi-node worker in background."
-# No --task for the worker defaults to aggregated mode
-TASK="" \
-srun \
-  --mpi pmix \
-  --oversubscribe \
-  --container-image "${IMAGE}" \
-  --container-mounts "${MOUNTS}" \
-  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
-  --verbose \
-  --label \
-  -A "${ACCOUNT}" \
-  -J "${ACCOUNT}-dynamo.trtllm" \
-  --nodes "${NUM_NODES}" \
-  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
-  --jobid "${SLURM_JOB_ID}" \
-  /mnt/start_trtllm_worker.sh &
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh
deleted file mode 100755
index 32cb4993a9..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh
+++ /dev/null
@@ -1,94 +0,0 @@
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-# This is one of the only variables that must be set currently, most of the rest may
-# just work out of the box if following the steps in the README.
-IMAGE="${IMAGE:-""}"
-
-# Set to mount current host directory to /mnt inside the container as an example,
-# but you may freely customize the mounts based on your cluster. A common practice
-# is to mount paths to NFS storage for common scripts, model weights, etc.
-# NOTE: This can be a comma separated list of multiple mounts as well.
-DEFAULT_MOUNT="${PWD}:/mnt"
-MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}"
-
-NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4}
-
-NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4}
-PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_prefill.yaml}"
-
-NUM_DECODE_NODES=${NUM_DECODE_NODES:-4}
-DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_decode.yaml}"
-
-# Automate settings of certain variables for convenience, but you are free
-# to manually set these for more control as well.
-ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
-export HEAD_NODE="${SLURMD_NODENAME}"
-export HEAD_NODE_IP="$(hostname -i)"
-export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
-export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
-
-if [[ -z ${IMAGE} ]]; then
-  echo "ERROR: You need to set the IMAGE environment variable to the " \
-       "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \
-       "See how to build one from source here: " \
-       "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker"
-  exit 1
-fi
-
-# NOTE: Output streamed to stdout for ease of understanding the example, but
-# in practice you would probably set `srun --output ... --error ...` to pipe
-# the stdout/stderr to files.
-echo "Launching frontend services in background."
-srun \
-  --overlap \
-  --container-image "${IMAGE}" \
-  --container-mounts "${MOUNTS}" \
-  --verbose \
-  --label \
-  -A "${ACCOUNT}" \
-  -J "${ACCOUNT}-dynamo.trtllm" \
-  --nodelist "${HEAD_NODE}" \
-  --nodes 1 \
-  --jobid "${SLURM_JOB_ID}" \
-  /mnt/start_frontend_services.sh &
-
-# NOTE: Output streamed to stdout for ease of understanding the example, but
-# in practice you would probably set `srun --output ... --error ...` to pipe
-# the stdout/stderr to files.
-echo "Launching multi-node prefill worker in background."
-TASK=prefill \
-ENGINE_CONFIG=${PREFILL_ENGINE_CONFIG} \
-srun \
-  --mpi pmix \
-  --oversubscribe \
-  --container-image "${IMAGE}" \
-  --container-mounts "${MOUNTS}" \
-  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
-  --verbose \
-  --label \
-  -A "${ACCOUNT}" \
-  -J "${ACCOUNT}-dynamo.trtllm" \
-  --nodes "${NUM_PREFILL_NODES}" \
-  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
-  --jobid "${SLURM_JOB_ID}" \
-  /mnt/start_trtllm_worker.sh &
-
-echo "Launching multi-node decode worker in background."
-TASK=decode \
-ENGINE_CONFIG=${DECODE_ENGINE_CONFIG} \
-srun \
-  --mpi pmix \
-  --oversubscribe \
-  --container-image "${IMAGE}" \
-  --container-mounts "${MOUNTS}" \
-  --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \
-  --verbose \
-  --label \
-  -A "${ACCOUNT}" \
-  -J "${ACCOUNT}-dynamo.trtllm" \
-  --nodes "${NUM_DECODE_NODES}" \
-  --ntasks-per-node "${NUM_GPUS_PER_NODE}" \
-  --jobid "${SLURM_JOB_ID}" \
-  /mnt/start_trtllm_worker.sh &
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh
deleted file mode 100755
index 0d1b588904..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-# Start NATS
-nats-server -js &
-
-# Start etcd
-etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
-
-# Wait for NATS/etcd to startup
-sleep 3
-
-# Start OpenAI Frontend which will dynamically discover workers when they startup
-# NOTE: This is a blocking call.
-dynamo-run in=http out=dyn --http-port 8000
diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh
deleted file mode 100755
index 257b3b1127..0000000000
--- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh
+++ /dev/null
@@ -1,46 +0,0 @@
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-if [[ -z ${MODEL_PATH} ]]; then
-    echo "ERROR: MODEL_PATH was not set."
-    echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \
-         "downloaded path to the model weights. Since Deepseek R1 is large, it is " \
-         "recommended to pre-download them to a shared location and provide the path."
-    exit 1
-fi
-
-if [[ -z ${SERVED_MODEL_NAME} ]]; then
-    echo "WARNING: SERVED_MODEL_NAME was not set. It will be derived from MODEL_PATH."
-fi
-
-
-
-if [[ -z ${ENGINE_CONFIG} ]]; then
-    echo "ERROR: ENGINE_CONFIG was not set."
-    echo "ERROR: ENGINE_CONFIG must be set to a valid Dynamo+TRTLLM engine config file."
-    exit 1
-fi
-
-EXTRA_ARGS=""
-if [[ -n ${TASK} ]]; then
-  EXTRA_ARGS+="--task ${TASK}"
-fi
-
-# NOTE: When this script is run directly from srun, the environment variables
-# for TRTLLM KV cache are not set. So we need to set them here.
-# Related issue: https://github.com/ai-dynamo/dynamo/issues/1743
-if [[ -z ${TRTLLM_USE_UCX_KVCACHE} ]] && [[ -z ${TRTLLM_USE_NIXL_KVCACHE} ]]; then
-    export TRTLLM_USE_UCX_KVCACHE=1
-fi
-
-# NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM
-# worker and registers itself with the runtime. It is currently easier to wrap
-# this standalone script with `trtllm-llmapi-launch` for MPI handling purposes,
-# but this may be refactored into 'dynamo serve' in the future.
-trtllm-llmapi-launch \
-  python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py \
-    --model-path "${MODEL_PATH}" \
-    --model-name "${SERVED_MODEL_NAME}" \
-    --extra-engine-args "${ENGINE_CONFIG}" \
-    ${EXTRA_ARGS}
diff --git a/examples/tensorrt_llm_sd/graphs/agg.py b/examples/tensorrt_llm_sd/graphs/agg.py
deleted file mode 100644
index e79f5f315c..0000000000
--- a/examples/tensorrt_llm_sd/graphs/agg.py
+++ /dev/null
@@ -1,19 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from components.frontend import Frontend
-from components.worker import TensorRTLLMWorker
-
-Frontend.link(TensorRTLLMWorker)
diff --git a/examples/tensorrt_llm_sd/graphs/disagg.py b/examples/tensorrt_llm_sd/graphs/disagg.py
deleted file mode 100644
index 58bde05d9a..0000000000
--- a/examples/tensorrt_llm_sd/graphs/disagg.py
+++ /dev/null
@@ -1,20 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from components.frontend import Frontend
-from components.prefill_worker import TensorRTLLMPrefillWorker
-from components.worker import TensorRTLLMWorker
-
-Frontend.link(TensorRTLLMWorker).link(TensorRTLLMPrefillWorker)

From 07a822c29bb7b61068bb1e784c8c423798a91684 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Wed, 9 Jul 2025 10:19:43 -0700
Subject: [PATCH 07/20] Standardizing names, Fixing File paths

---
 .../configs/llama4/eagle/{mtp_agg.yaml => eagle_agg.yaml}     | 2 +-
 .../llama4/eagle/{mtp_disagg.yaml => eagle_disagg.yaml}       | 4 ++--
 .../configs/llama4/eagle/engine_configs/agg_config.yaml       | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)
 rename examples/tensorrt_llm/configs/llama4/eagle/{mtp_agg.yaml => eagle_agg.yaml} (93%)
 rename examples/tensorrt_llm/configs/llama4/eagle/{mtp_disagg.yaml => eagle_disagg.yaml} (98%)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
similarity index 93%
rename from examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml
rename to examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
index 6a64336101..9ea7e3c265 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
@@ -23,7 +23,7 @@ Frontend:
 TensorRTLLMWorker:
   served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
   model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  extra-engine-args: "configs/llama4/engine_configs/agg_config.yaml"
+  extra-engine-args: "configs/llama4/eagle/engine_configs/agg_config.yaml"
   router: round-robin
   ServiceArgs:
     workers: 1
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
similarity index 98%
rename from examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml
rename to examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
index 72d3ce6f29..a04a9114c2 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
@@ -34,7 +34,7 @@ TensorRTLLMWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 4
+      gpu: 8
 
 TensorRTLLMPrefillWorker:
   # NOTE: FP4 only supported starting with Blackwell GPUs.
@@ -49,4 +49,4 @@ TensorRTLLMPrefillWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 4
+      gpu: 8
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
index 633d630633..053bad7e0f 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -20,14 +20,14 @@
 
 backend: pytorch
 tensor_parallel_size: 4
-moe_expert_parallel_size: 4
+moe_expert_parallel_size: 1
 # enable_attention_dp: true
 max_batch_size: 256
 # 8448 = 8192 ISL + 256 OSL
 max_num_tokens: 8448
 max_seq_len: 8448
 kv_cache_config:
-  free_gpu_memory_fraction: 0.30
+  free_gpu_memory_fraction: 0.25
 
 # Enable the MTP(Multi-Token Prediction) in the model engine
 speculative_config:

From 070962d5f6213e17f996ad72bbd7568bf06b765a Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Wed, 9 Jul 2025 13:52:52 -0700
Subject: [PATCH 08/20] Correcting Model Name and Decoding Type

---
 .../tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml |  6 +++---
 .../configs/llama4/eagle/eagle_disagg.yaml           | 12 ++++--------
 .../llama4/eagle/engine_configs/agg_config.yaml      |  2 +-
 .../llama4/eagle/engine_configs/decode_config.yaml   |  2 +-
 .../llama4/eagle/engine_configs/prefill_config.yaml  |  2 +-
 5 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
index 9ea7e3c265..bcf97cfa64 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
@@ -15,14 +15,14 @@
 
 Frontend:
   # This is the client-facing model name, you can set this to anything you'd like.
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  served_model_name: "meta-llama/Llama-4-Maverick-17B-128E"
   endpoint: dynamo.TensorRTLLMWorker.generate
   port: 8000
   router: round-robin
 
 TensorRTLLMWorker:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  served_model_name: "meta-llama/Llama-4-Maverick-17B-128E"
+  model-path: "meta-llama/Llama-4-Maverick-17B-128E"
   extra-engine-args: "configs/llama4/eagle/engine_configs/agg_config.yaml"
   router: round-robin
   ServiceArgs:
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
index a04a9114c2..429acd1d12 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
@@ -14,18 +14,14 @@
 # limitations under the License.
 
 Frontend:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  served_model_name: "meta-llama/Llama-4-Maverick-17B-128E"
   endpoint: dynamo.TensorRTLLMWorker.generate
   port: 8000
   router: round-robin
 
 TensorRTLLMWorker:
-  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  served_model_name: "meta-llama/Llama-4-Maverick-17B-128E"
+  model-path: "meta-llama/Llama-4-Maverick-17B-128E"
   # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
   # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
   extra-engine-args: "configs/llama4/eagle/engine_configs/decode_config.yaml"
@@ -41,7 +37,7 @@ TensorRTLLMPrefillWorker:
   # https://huggingface.co/nvidia/DeepSeek-R1-FP4
   # You can also specify the full path to locally downloaded weights
   # instead of a HuggingFace ID here.
-  model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3"
+  model-path: "meta-llama/Llama-4-Maverick-17B-128E"
   # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
   # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
   extra-engine-args: "configs/llama4/eagle/engine_configs/prefill_config.yaml"
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
index 053bad7e0f..d6528c891c 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -31,7 +31,7 @@ kv_cache_config:
 
 # Enable the MTP(Multi-Token Prediction) in the model engine
 speculative_config:
-  decoding_type: MTP
+  decoding_type: Eagle
   num_nextn_predict_layers: 1
 
 use_cuda_graph: true
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
index fed64bcb22..16bca4c893 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
@@ -34,7 +34,7 @@ kv_cache_config:
 
 # Enable the MTP(Multi-Token Prediction) in decode model engine
 speculative_config:
-  decoding_type: MTP
+  decoding_type: Eagle
   num_nextn_predict_layers: 1
 
 use_cuda_graph: true
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
index 6dd4bca5ed..6286876f45 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
@@ -33,5 +33,5 @@ disable_overlap_scheduler: true
 
 # Enable the MTP(Multi-Token Prediction) in the prefill model engine
 speculative_config:
-  decoding_type: MTP
+  decoding_type: Eagle
   num_nextn_predict_layers: 1

From d02beeb0764c24fa86b643fb1949cb00fab9ecd4 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Wed, 9 Jul 2025 23:53:12 -0700
Subject: [PATCH 09/20] Adding Speculative Decoding/KV Config Fields

---
 .../llama4/eagle/engine_configs/agg_config.yaml    | 14 +++++++++-----
 .../llama4/eagle/engine_configs/decode_config.yaml | 13 ++++++++-----
 .../eagle/engine_configs/prefill_config.yaml       | 12 +++++++-----
 3 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
index d6528c891c..e3f48e8076 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -21,18 +21,22 @@
 backend: pytorch
 tensor_parallel_size: 4
 moe_expert_parallel_size: 1
-# enable_attention_dp: true
 max_batch_size: 256
 # 8448 = 8192 ISL + 256 OSL
 max_num_tokens: 8448
 max_seq_len: 8448
-kv_cache_config:
-  free_gpu_memory_fraction: 0.25
 
-# Enable the MTP(Multi-Token Prediction) in the model engine
+# Enable Speculative Decoding in the model engine
 speculative_config:
   decoding_type: Eagle
-  num_nextn_predict_layers: 1
+  max_draft_len: 1
+  pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
+
+kv_cache_config:
+  free_gpu_memory_fraction: 0.5
+  enable_block_reuse: false  
+
+disable_overlap_scheduler: true
 
 use_cuda_graph: true
 cuda_graph_padding_enabled: true
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
index 16bca4c893..f0dfe4555f 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
@@ -21,7 +21,6 @@
 backend: pytorch
 tensor_parallel_size: 4
 moe_expert_parallel_size: 4
-# enable_attention_dp: false
 max_batch_size: 256
 # Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
 # max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
@@ -29,13 +28,17 @@ max_batch_size: 256
 max_num_tokens: 512
 # 8704 = 8192 ISL + 512 OSL
 max_seq_len: 8704
-kv_cache_config:
-  free_gpu_memory_fraction: 0.85
+disable_overlap_scheduler: true
 
-# Enable the MTP(Multi-Token Prediction) in decode model engine
+# Enable Speculative Decoding in the model engine
 speculative_config:
   decoding_type: Eagle
-  num_nextn_predict_layers: 1
+  max_draft_len: 1
+  pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
+
+kv_cache_config:
+  free_gpu_memory_fraction: 0.5
+  enable_block_reuse: false  
 
 use_cuda_graph: true
 cuda_graph_padding_enabled: true
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
index 6286876f45..76e2ee26fc 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
@@ -21,17 +21,19 @@
 backend: pytorch
 tensor_parallel_size: 4
 moe_expert_parallel_size: 4
-# enable_attention_dp: true
 max_batch_size: 1
 max_num_tokens: 8192
 max_seq_len: 8192
-kv_cache_config:
-  free_gpu_memory_fraction: 0.75
 print_iter_log: true
 kv_cache_dtype: fp8
 disable_overlap_scheduler: true
 
-# Enable the MTP(Multi-Token Prediction) in the prefill model engine
+# Enable Speculative Decoding in the model engine
 speculative_config:
   decoding_type: Eagle
-  num_nextn_predict_layers: 1
+  max_draft_len: 1
+  pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
+
+kv_cache_config:
+  free_gpu_memory_fraction: 0.5
+  enable_block_reuse: false  

From f0a570766d504dccad123bbe23c0dab32f133492 Mon Sep 17 00:00:00 2001
From: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
Date: Thu, 10 Jul 2025 09:04:21 -0700
Subject: [PATCH 10/20] Adding eagle3_one_model key-value

Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
---
 .../configs/llama4/eagle/engine_configs/prefill_config.yaml      | 1 +
 1 file changed, 1 insertion(+)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
index 76e2ee26fc..a356fc0a37 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
@@ -33,6 +33,7 @@ speculative_config:
   decoding_type: Eagle
   max_draft_len: 1
   pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
+  eagle3_one_model: False
 
 kv_cache_config:
   free_gpu_memory_fraction: 0.5

From 48b7786fc4ab145346a1dc38f04f8d2797438082 Mon Sep 17 00:00:00 2001
From: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
Date: Thu, 10 Jul 2025 09:04:32 -0700
Subject: [PATCH 11/20] Adding eagle3_one_model key-value

Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
---
 .../configs/llama4/eagle/engine_configs/decode_config.yaml       | 1 +
 1 file changed, 1 insertion(+)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
index f0dfe4555f..bf57dde642 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
@@ -35,6 +35,7 @@ speculative_config:
   decoding_type: Eagle
   max_draft_len: 1
   pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
+  eagle3_one_model: False
 
 kv_cache_config:
   free_gpu_memory_fraction: 0.5

From 142e0eafc9aa98bd9d110fa0a1a8e3d7dbf718f9 Mon Sep 17 00:00:00 2001
From: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
Date: Thu, 10 Jul 2025 09:04:43 -0700
Subject: [PATCH 12/20] Adding eagle3_one_model key-value

Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
---
 .../configs/llama4/eagle/engine_configs/agg_config.yaml          | 1 +
 1 file changed, 1 insertion(+)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
index e3f48e8076..65af4b7ac3 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -31,6 +31,7 @@ speculative_config:
   decoding_type: Eagle
   max_draft_len: 1
   pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3
+  eagle3_one_model: False
 
 kv_cache_config:
   free_gpu_memory_fraction: 0.5

From 0985d105c9727b1101aad9c6aebabce03a3b332c Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Fri, 11 Jul 2025 16:45:30 -0700
Subject: [PATCH 13/20] Update Config

---
 .../configs/llama4/eagle/eagle_agg.yaml       |  6 +++---
 .../configs/llama4/eagle/eagle_disagg.yaml    | 12 ++++-------
 .../eagle/engine_configs/agg_config.yaml      | 21 +++++++------------
 .../eagle/engine_configs/decode_config.yaml   | 11 ++--------
 .../eagle/engine_configs/prefill_config.yaml  |  8 ++-----
 5 files changed, 19 insertions(+), 39 deletions(-)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
index bcf97cfa64..91e1112008 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
@@ -15,14 +15,14 @@
 
 Frontend:
   # This is the client-facing model name, you can set this to anything you'd like.
-  served_model_name: "meta-llama/Llama-4-Maverick-17B-128E"
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
   endpoint: dynamo.TensorRTLLMWorker.generate
   port: 8000
   router: round-robin
 
 TensorRTLLMWorker:
-  served_model_name: "meta-llama/Llama-4-Maverick-17B-128E"
-  model-path: "meta-llama/Llama-4-Maverick-17B-128E"
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
+  model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
   extra-engine-args: "configs/llama4/eagle/engine_configs/agg_config.yaml"
   router: round-robin
   ServiceArgs:
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
index 429acd1d12..c255bcf8f5 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
@@ -14,14 +14,14 @@
 # limitations under the License.
 
 Frontend:
-  served_model_name: "meta-llama/Llama-4-Maverick-17B-128E"
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
   endpoint: dynamo.TensorRTLLMWorker.generate
   port: 8000
   router: round-robin
 
 TensorRTLLMWorker:
-  served_model_name: "meta-llama/Llama-4-Maverick-17B-128E"
-  model-path: "meta-llama/Llama-4-Maverick-17B-128E"
+  served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
+  model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
   # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
   # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
   extra-engine-args: "configs/llama4/eagle/engine_configs/decode_config.yaml"
@@ -33,11 +33,7 @@ TensorRTLLMWorker:
       gpu: 8
 
 TensorRTLLMPrefillWorker:
-  # NOTE: FP4 only supported starting with Blackwell GPUs.
-  # https://huggingface.co/nvidia/DeepSeek-R1-FP4
-  # You can also specify the full path to locally downloaded weights
-  # instead of a HuggingFace ID here.
-  model-path: "meta-llama/Llama-4-Maverick-17B-128E"
+  model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
   # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.
   # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields.
   extra-engine-args: "configs/llama4/eagle/engine_configs/prefill_config.yaml"
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
index 65af4b7ac3..83bf12286e 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -13,18 +13,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
 backend: pytorch
-tensor_parallel_size: 4
-moe_expert_parallel_size: 1
+tensor_parallel_size: 8
+moe_expert_parallel_size: 4
 max_batch_size: 256
-# 8448 = 8192 ISL + 256 OSL
-max_num_tokens: 8448
-max_seq_len: 8448
+# When max_num_tokens set to higher values, can cause OOM issues.
+# Will be investigated in the future with TRTLLM team.
+max_num_tokens: 1024
+max_seq_len: 1024
+autotuner_enabled: false
 
 # Enable Speculative Decoding in the model engine
 speculative_config:
@@ -35,9 +32,7 @@ speculative_config:
 
 kv_cache_config:
   free_gpu_memory_fraction: 0.5
-  enable_block_reuse: false  
-
-disable_overlap_scheduler: true
+  enable_block_reuse: false
 
 use_cuda_graph: true
 cuda_graph_padding_enabled: true
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
index bf57dde642..4b595d2126 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml
@@ -13,22 +13,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
 backend: pytorch
 tensor_parallel_size: 4
 moe_expert_parallel_size: 4
 max_batch_size: 256
-# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
-# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
-# This is a known issue in TensorRT-LLM and will be resolved in the next release.
 max_num_tokens: 512
 # 8704 = 8192 ISL + 512 OSL
 max_seq_len: 8704
 disable_overlap_scheduler: true
+autotuner_enabled: false
 
 # Enable Speculative Decoding in the model engine
 speculative_config:
@@ -39,7 +32,7 @@ speculative_config:
 
 kv_cache_config:
   free_gpu_memory_fraction: 0.5
-  enable_block_reuse: false  
+  enable_block_reuse: false
 
 use_cuda_graph: true
 cuda_graph_padding_enabled: true
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
index a356fc0a37..8442e478ba 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml
@@ -13,11 +13,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# NOTE: FP4 only supported starting with Blackwell GPUs.
-# https://huggingface.co/nvidia/DeepSeek-R1-FP4
-# You can also specify the full path to locally downloaded weights
-# instead of a HuggingFace ID here.
-
 backend: pytorch
 tensor_parallel_size: 4
 moe_expert_parallel_size: 4
@@ -27,6 +22,7 @@ max_seq_len: 8192
 print_iter_log: true
 kv_cache_dtype: fp8
 disable_overlap_scheduler: true
+autotuner_enabled: false
 
 # Enable Speculative Decoding in the model engine
 speculative_config:
@@ -37,4 +33,4 @@ speculative_config:
 
 kv_cache_config:
   free_gpu_memory_fraction: 0.5
-  enable_block_reuse: false  
+  enable_block_reuse: false

From 8bd09ad7e07df1c7125f782534c3b8b20377c36e Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Fri, 11 Jul 2025 16:50:25 -0700
Subject: [PATCH 14/20] Update GPU/TP Count

---
 examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml     | 2 +-
 examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml  | 4 ++--
 .../configs/llama4/eagle/engine_configs/agg_config.yaml       | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
index 91e1112008..fe4a94df4b 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml
@@ -28,4 +28,4 @@ TensorRTLLMWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 8
+      gpu: 4
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
index c255bcf8f5..3bfe111fac 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml
@@ -30,7 +30,7 @@ TensorRTLLMWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 8
+      gpu: 4
 
 TensorRTLLMPrefillWorker:
   model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
@@ -41,4 +41,4 @@ TensorRTLLMPrefillWorker:
   ServiceArgs:
     workers: 1
     resources:
-      gpu: 8
+      gpu: 4
diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
index 83bf12286e..caa3c9ea3c 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -14,7 +14,7 @@
 # limitations under the License.
 
 backend: pytorch
-tensor_parallel_size: 8
+tensor_parallel_size: 4
 moe_expert_parallel_size: 4
 max_batch_size: 256
 # When max_num_tokens set to higher values, can cause OOM issues.

From a4ed3f77afe68f6e03cae4089107a830c868f721 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Fri, 11 Jul 2025 16:53:00 -0700
Subject: [PATCH 15/20] Adding back disable_overlap_scheduler key-value

---
 .../configs/llama4/eagle/engine_configs/agg_config.yaml          | 1 +
 1 file changed, 1 insertion(+)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
index caa3c9ea3c..6cc1305a24 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -22,6 +22,7 @@ max_batch_size: 256
 max_num_tokens: 1024
 max_seq_len: 1024
 autotuner_enabled: false
+disable_overlap_scheduler: true
 
 # Enable Speculative Decoding in the model engine
 speculative_config:

From d054d530d2f8a4846f91856d1ec500c26a11dc27 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Fri, 11 Jul 2025 16:56:02 -0700
Subject: [PATCH 16/20] Updating max_seq_len

---
 .../configs/llama4/eagle/engine_configs/agg_config.yaml         | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
index 6cc1305a24..1bed25ef27 100644
--- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
+++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml
@@ -20,7 +20,7 @@ max_batch_size: 256
 # When max_num_tokens set to higher values, can cause OOM issues.
 # Will be investigated in the future with TRTLLM team.
 max_num_tokens: 1024
-max_seq_len: 1024
+max_seq_len: 8448
 autotuner_enabled: false
 disable_overlap_scheduler: true
 

From c5fc5c8a884130f1fd688d9202938cdf98f59be5 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Fri, 11 Jul 2025 18:46:08 -0700
Subject: [PATCH 17/20] Updating README.md

---
 examples/tensorrt_llm/README.md | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/examples/tensorrt_llm/README.md b/examples/tensorrt_llm/README.md
index f844a56d94..7e8e086a39 100644
--- a/examples/tensorrt_llm/README.md
+++ b/examples/tensorrt_llm/README.md
@@ -350,3 +350,25 @@ unset TRTLLM_USE_NIXL_KVCACHE
 export TRTLLM_USE_UCX_KVCACHE=1
 ```
 
+
+### Example architectures for Llama 4 Maverick Instruct + Eagle Speculative Decoding
+
+#### Notes
+* The current example has been tested out on a GB200x4 node.
+* To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria:
+  * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21)
+  * It includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975)
+
+##### Aggregated Serving
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_agg.yaml
+```
+* Known Issue: In Aggregated Serving, when the `max_num_tokens` was set to higher values, in our case 8448, we experienced Out of Memory (OOM) errors. This is being investigated by the TRTLLM team.
+
+
+##### Disaggregated Serving
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_disagg.yaml
+```
\ No newline at end of file

From 25ba4f150e512b07fbb7763c271c4231aa244661 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Fri, 11 Jul 2025 18:48:35 -0700
Subject: [PATCH 18/20] Fix wording

---
 examples/tensorrt_llm/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/tensorrt_llm/README.md b/examples/tensorrt_llm/README.md
index 7e8e086a39..b447eef45f 100644
--- a/examples/tensorrt_llm/README.md
+++ b/examples/tensorrt_llm/README.md
@@ -357,7 +357,7 @@ export TRTLLM_USE_UCX_KVCACHE=1
 * The current example has been tested out on a GB200x4 node.
 * To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria:
   * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21)
-  * It includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975)
+  * The TensorRT-LLM build includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975)
 
 ##### Aggregated Serving
 ```bash

From 1de9595b10948ff9bfb895707b26fb38202398f5 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Fri, 11 Jul 2025 19:01:34 -0700
Subject: [PATCH 19/20] Adding to README.md

---
 examples/tensorrt_llm/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/examples/tensorrt_llm/README.md b/examples/tensorrt_llm/README.md
index b447eef45f..d94f312ca7 100644
--- a/examples/tensorrt_llm/README.md
+++ b/examples/tensorrt_llm/README.md
@@ -358,14 +358,14 @@ export TRTLLM_USE_UCX_KVCACHE=1
 * To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria:
   * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21)
   * The TensorRT-LLM build includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975)
+* If you need to download model weights off huggingface, make sure you run the command `huggingface-cli login` and have access to the necessary gated models.
 
 ##### Aggregated Serving
 ```bash
 cd /workspace/examples/tensorrt_llm
 dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_agg.yaml
 ```
-* Known Issue: In Aggregated Serving, when the `max_num_tokens` was set to higher values, in our case 8448, we experienced Out of Memory (OOM) errors. This is being investigated by the TRTLLM team.
-
+* Known Issue: In Aggregated Serving, setting `max_num_tokens` to higher values (e.g. `max_num_tokens: 8448`) can lead to Out of Memory (OOM) errors. This is being investigated by the TRTLLM team.
 
 ##### Disaggregated Serving
 ```bash

From fabab466564e47fd25ed76cd04bd33e1bc034c71 Mon Sep 17 00:00:00 2001
From: Krishnan Prashanth <kprashanth@nvidia.com>
Date: Mon, 14 Jul 2025 10:26:23 -0700
Subject: [PATCH 20/20] Updates to disaggregate workflow

---
 examples/tensorrt_llm/README.md | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/examples/tensorrt_llm/README.md b/examples/tensorrt_llm/README.md
index d94f312ca7..b6169efd54 100644
--- a/examples/tensorrt_llm/README.md
+++ b/examples/tensorrt_llm/README.md
@@ -354,7 +354,9 @@ export TRTLLM_USE_UCX_KVCACHE=1
 ### Example architectures for Llama 4 Maverick Instruct + Eagle Speculative Decoding
 
 #### Notes
-* The current example has been tested out on a GB200x4 node.
+* Testing for the current example used:
+  * One GB200x4 node for aggregate serving
+  * Two GB200x4 nodes for disaggregate serving
 * To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria:
   * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21)
   * The TensorRT-LLM build includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975)
@@ -368,7 +370,31 @@ dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_agg.yaml
 * Known Issue: In Aggregated Serving, setting `max_num_tokens` to higher values (e.g. `max_num_tokens: 8448`) can lead to Out of Memory (OOM) errors. This is being investigated by the TRTLLM team.
 
 ##### Disaggregated Serving
+
+###### Head Node
+Start nats/etcd
+``` bash
+nats-server -js &
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
+```
+
+Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:
+
+```bash
+cd /workspace/examples/tensorrt_llm
+dynamo serve graphs.agg:Frontend -f configs/llama4/eagle/eagle_disagg.yaml  &
+```
+
+###### Worker Node(s)
+Set environment variables pointing at the etcd/nats endpoints on the head node.
+```bash
+export HEAD_NODE_IP="<head-node-ip>"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+```
+
+Deploy a Prefill worker:
 ```bash
 cd /workspace/examples/tensorrt_llm
-dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_disagg.yaml
+dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/llama4/eagle/eagle_disagg.yaml --service-name TensorRTLLMPrefillWorker &
 ```
\ No newline at end of file