From 9caa48ac1cd7c68c8ecf28d7823b52fbc4fa7119 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Mon, 7 Jul 2025 11:06:18 -0700 Subject: [PATCH 01/20] basic readme.md --- examples/tensorrt_llm_sd/README.md | 352 +++++++++++++++++++++++++++++ 1 file changed, 352 insertions(+) create mode 100644 examples/tensorrt_llm_sd/README.md diff --git a/examples/tensorrt_llm_sd/README.md b/examples/tensorrt_llm_sd/README.md new file mode 100644 index 0000000000..f844a56d94 --- /dev/null +++ b/examples/tensorrt_llm_sd/README.md @@ -0,0 +1,352 @@ + + +# LLM Deployment Examples using TensorRT-LLM + +This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM. + +## Use the Latest Release + +We recommend using the latest stable release of dynamo to avoid breaking changes: + +[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest) + +You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with: + +```bash +git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) +``` + +## Deployment Architectures + +See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. +Note that this TensorRT-LLM version does not support all the options yet. + +Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving. + +## Getting Started + +1. Choose a deployment architecture based on your requirements +2. Configure the components as needed +3. Deploy using the provided scripts + +### Prerequisites + +Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml) +```bash +docker compose -f deploy/metrics/docker-compose.yml up -d +``` + +### Build docker + +```bash +# TensorRT-LLM uses git-lfs, which needs to be installed in advance. +apt-get update && apt-get -y install git git-lfs + +# On an x86 machine: +./container/build.sh --framework tensorrtllm + +# On an ARM machine: +./container/build.sh --framework tensorrtllm --platform linux/arm64 + +# Build the container with the default experimental TensorRT-LLM commit +# WARNING: This is for experimental feature testing only. +# The container should not be used in a production environment. +./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit +``` + +### Run container + +``` +./container/run.sh --framework tensorrtllm -it +``` +## Run Deployment + +This figure shows an overview of the major components to deploy: + + + +``` + ++------+ +-----------+ +------------------+ +---------------+ +| HTTP |----->| processor |----->| Worker |------------>| Prefill | +| |<-----| |<-----| |<------------| Worker | ++------+ +-----------+ +------------------+ +---------------+ + | ^ | + query best | | return | publish kv events + worker | | worker_id v + | | +------------------+ + | +---------| kv-router | + +------------->| | + +------------------+ + +``` + +Note: The above architecture illustrates all the components. The final components +that get spawned depend upon the chosen graph. + +### Example architectures + +#### Aggregated serving +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml +``` + +#### Aggregated serving with KV Routing +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.agg:Frontend -f ./configs/agg_router.yaml +``` + +#### Disaggregated serving +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml +``` + +#### Disaggregated serving with KV Routing +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.disagg:Frontend -f ./configs/disagg_router.yaml +``` + +#### Aggregated serving with Multi-Token Prediction (MTP) and DeepSeek R1 +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml +``` + +Notes: +- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script. + + Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit` + +- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. +- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates. + +#### Multi-Node Disaggregated Serving + +In the following example, we will demonstrate how to run a Disaggregated Serving +deployment across multiple nodes. For simplicity, we will demonstrate how to +deploy a single Decode worker on one node, and a single Prefill worker on the other node. +However, the instance counts, TP sizes, other configs, and responsibilities of each node +can be customized and deployed in similar ways. + +For example, to deploy Deepseek R1, you could replace the referenced example +configs (`configs/agg.yaml`, `configs/disagg.yaml`) with corresponding Deepseek R1 +example configs (`configs/deepseek_r1/agg.yaml`, `configs/deepseek_r1/disagg.yaml`). +You can find the example Deepseek R1 configs for GB200 +[here](configs/deepseek_r1), but the config settings can be customized for testing +other hardware configurations or parallelism strategies. + +This "multi-node" example demonstrates how to generally connect dynamo workers from +different nodes, but for simplicity, each worker individually fits on a single node. +For details on how to launch a worker that spans multiple nodes due to sheer model +size, or for features like large scale expert parallelism, see the +[multinode worker example](configs/deepseek_r1/multinode). + +##### Head Node + +Start nats/etcd: +```bash +# NATS data persisted to /tmp/nats/jetstream by default +nats-server -js & + +# Persist data to /tmp/etcd, otherwise defaults to ${PWD}/default.etcd if left unspecified +etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd & + +# NOTE: Clearing out the etcd and nats jetstream data directories across runs +# helps to guarantee a clean and reproducible results. +``` + +Launch graph of Frontend and TensorRTLLMWorker (decode) on head node: + +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.agg:Frontend -f ./configs/disagg.yaml & +``` + +Notes: +- The aggregated graph (`graphs.agg`) is chosen here because it also describes + our desired deployment settings for the head node: launching the utility components + (Frontend, Processor), and only the decode worker (TensorRTLLMWorker configured with + `remote-prefill` enabled). We plan to launch the `TensorRTLLMPrefillWorker` + independently on a separate node in the next step of this demonstration. + You are free to customize the graph and configuration of components launched on + each node. +- The disaggregated config `configs/disagg.yaml` is intentionally chosen here as a + single source of truth to be used for deployments on all of our nodes, describing + the configurations for all of our components, including both decode and prefill + workers, but can be customized based on your deployment needs. + +##### Worker Node(s) + +Set environment variables pointing at the etcd/nats endpoints on the head node +so the Dynamo Distributed Runtime can orchestrate communication and +discoverability between the head node and worker nodes: +```bash +# if not head node +export HEAD_NODE_IP="" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" +``` + +Deploy a Prefill worker: +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker & +``` + +Now you have a 2-node deployment with 1 Decode worker on the head node, and 1 Prefill worker on a worker node! + +##### Additional Notes for Multi-Node Deployments + +Notes: +- To include a router in this deployment, change the graph to one that includes the router, such as `graphs.agg_router`, + and change the config to one that includes the router, such as `configs/disagg_router.yaml` +- This step is assuming you're disaggregated serving and planning to launch prefill workers on separate nodes. + Howerver, for an aggregated deployment with additional aggregated worker replicas on other nodes, this step + remains mostly the same. The primary difference between aggregation and disaggregation for this step is + whether or not the `TensorRTLLMWorker` is configured to do `remote-prefill` or not in the config file + (ex: `configs/disagg.yaml` vs `configs/agg.yaml`). +- To apply the same concept for launching additional decode workers on worker nodes, you can + directly start them, similar to the prefill worker step above: + ```bash + # Example: deploy decode worker only + cd /workspace/examples/tensorrt_llm + dynamo serve components.worker:TensorRTLLMWorker -f ./configs/disagg.yaml --service-name TensorRTLLMWorker & + ``` +- If you see an error about MPI Spawn failing during TRTLLM Worker initialziation on a Slurm-based cluster, + try unsetting the following environment variables before launching the TRTLLM worker. If you intend to + run other slurm-based commands or processes on the same node after deploying the TRTLLM worker, you may + want to save these values into temporary variables and then restore them afterwards. + ```bash + # Workaround for error: `mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes` + unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST + ``` + +#### Multi-Node Disaggregated Serving with Multi-Token Prediction (MTP) and DeepSeek R1 + +Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations + +##### Head Node + +Start nats/etcd +```bash +nats-server -js & +etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd & +``` + +Launch graph of Frontend and TensorRTLLMWorker (decode) on head node: + +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_disagg.yaml & +``` + +##### Worker Node(s) + +Set environment variables pointing at the etcd/nats endpoints on the head node. +```bash +export HEAD_NODE_IP="" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" +``` + +Deploy a Prefill worker: +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/deepseek_r1/mtp/mtp_disagg.yaml --service-name TensorRTLLMPrefillWorker & +``` + +Notes: +- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script. + + Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit` +- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. +- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates. + + +### Client + +See [client](../llm/README.md#client) section to learn how to send request to the deployment. + +NOTE: To send a request to a multi-node deployment, target the node which deployed the `Frontend` component. + +### Close deployment + +See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment. + +### Benchmarking + +To benchmark your deployment with GenAI-Perf, see this utility script, configuring the +`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh) + + +### KV Cache Transfer for Disaggregated Serving + +In disaggregated serving architectures, KV cache must be transferred between prefill and decode nodes. TensorRT-LLM supports two methods for this transfer: + +#### Default Method: UCX +By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode nodes. UCX provides high-performance communication optimized for GPU-to-GPU transfers. + +#### Experimental Method: NIXL +TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments. + +**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet. + +#### Using NIXL for KV Cache Transfer + +**Note:** NIXL backend for TensorRT-LLM is currently only supported on AMD64 (x86_64) architecture. If you're running on ARM64, you'll need to use the default UCX method for KV cache transfer. + +To enable NIXL for KV cache transfer in disaggregated serving: + +1. **Build the container with NIXL support:** + The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support. + + **Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):** + ```bash + rm -rf /tmp/trtllm_wheel + ``` + + **Build the container with NIXL support:** + ```bash + ./container/build.sh --framework tensorrtllm \ + --use-default-experimental-tensorrtllm-commit \ + --trtllm-use-nixl-kvcache-experimental + ``` + + **Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support. + +2. **Run the containerized environment:** + See [run container](#run-container) section to learn how to start the container image built in previous step. + +3. **Start the disaggregated service:** + See [disaggregated serving](#disaggregated-serving) to see how to start the deployment. + +4. **Send the request:** + See [client](#client) section to learn how to send the request to deployment. + +**Important:** Ensure that ETCD and NATS services are running before starting the service. + +The container will automatically configure the appropriate environment variables (`TRTLLM_USE_NIXL_KVCACHE=1`) when built with the NIXL flag. The same container image can be used to use UCX for KV cache transfer. +```bash +unset TRTLLM_USE_NIXL_KVCACHE +export TRTLLM_USE_UCX_KVCACHE=1 +``` + From bf4c6bbbbe750d874ec60b2b62690bf4d1c71a89 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Mon, 7 Jul 2025 15:02:51 -0700 Subject: [PATCH 02/20] Copied over previous tutorial --- examples/tensorrt_llm_sd/__init__.py | 14 + examples/tensorrt_llm_sd/common/__init__.py | 0 .../tensorrt_llm_sd/common/base_engine.py | 389 ++++++++++++++++++ examples/tensorrt_llm_sd/common/parser.py | 62 +++ examples/tensorrt_llm_sd/common/protocol.py | 104 +++++ .../tensorrt_llm_sd/components/frontend.py | 119 ++++++ .../components/prefill_worker.py | 75 ++++ examples/tensorrt_llm_sd/components/worker.py | 115 ++++++ examples/tensorrt_llm_sd/configs/agg.yaml | 34 ++ .../tensorrt_llm_sd/configs/agg_router.yaml | 34 ++ .../configs/deepseek_r1/agg.yaml | 35 ++ .../configs/deepseek_r1/disagg.yaml | 49 +++ .../engine_configs/agg_config.yaml | 54 +++ .../engine_configs/decode_config.yaml | 55 +++ .../engine_configs/prefill_config.yaml | 37 ++ .../mtp/engine_configs/agg_config.yaml | 50 +++ .../mtp/engine_configs/decode_config.yaml | 53 +++ .../mtp/engine_configs/prefill_config.yaml | 37 ++ .../configs/deepseek_r1/mtp/mtp_agg.yaml | 36 ++ .../configs/deepseek_r1/mtp/mtp_disagg.yaml | 52 +++ .../configs/deepseek_r1/multinode/README.md | 275 +++++++++++++ .../multinode/engine_configs/dep16_agg.yaml | 27 ++ .../multinode/engine_configs/eplb.yaml | 7 + .../multinode/engine_configs/wide_ep_agg.yaml | 35 ++ .../engine_configs/wide_ep_decode.yaml | 59 +++ .../engine_configs/wide_ep_prefill.yaml | 41 ++ .../deepseek_r1/multinode/srun_aggregated.sh | 75 ++++ .../multinode/srun_disaggregated.sh | 94 +++++ .../multinode/start_frontend_services.sh | 16 + .../multinode/start_trtllm_worker.sh | 46 +++ examples/tensorrt_llm_sd/configs/disagg.yaml | 48 +++ .../configs/disagg_router.yaml | 47 +++ .../configs/engine_configs/agg_config.yaml | 31 ++ .../configs/engine_configs/decode_config.yaml | 27 ++ .../engine_configs/prefill_config.yaml | 28 ++ examples/tensorrt_llm_sd/graphs/agg.py | 19 + examples/tensorrt_llm_sd/graphs/disagg.py | 20 + 37 files changed, 2299 insertions(+) create mode 100644 examples/tensorrt_llm_sd/__init__.py create mode 100644 examples/tensorrt_llm_sd/common/__init__.py create mode 100644 examples/tensorrt_llm_sd/common/base_engine.py create mode 100644 examples/tensorrt_llm_sd/common/parser.py create mode 100644 examples/tensorrt_llm_sd/common/protocol.py create mode 100644 examples/tensorrt_llm_sd/components/frontend.py create mode 100644 examples/tensorrt_llm_sd/components/prefill_worker.py create mode 100644 examples/tensorrt_llm_sd/components/worker.py create mode 100644 examples/tensorrt_llm_sd/configs/agg.yaml create mode 100644 examples/tensorrt_llm_sd/configs/agg_router.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml create mode 100644 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml create mode 100755 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh create mode 100755 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh create mode 100755 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh create mode 100755 examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh create mode 100644 examples/tensorrt_llm_sd/configs/disagg.yaml create mode 100644 examples/tensorrt_llm_sd/configs/disagg_router.yaml create mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml create mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml create mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml create mode 100644 examples/tensorrt_llm_sd/graphs/agg.py create mode 100644 examples/tensorrt_llm_sd/graphs/disagg.py diff --git a/examples/tensorrt_llm_sd/__init__.py b/examples/tensorrt_llm_sd/__init__.py new file mode 100644 index 0000000000..3159bfe656 --- /dev/null +++ b/examples/tensorrt_llm_sd/__init__.py @@ -0,0 +1,14 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/examples/tensorrt_llm_sd/common/__init__.py b/examples/tensorrt_llm_sd/common/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/tensorrt_llm_sd/common/base_engine.py b/examples/tensorrt_llm_sd/common/base_engine.py new file mode 100644 index 0000000000..3df95b490c --- /dev/null +++ b/examples/tensorrt_llm_sd/common/base_engine.py @@ -0,0 +1,389 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +from dataclasses import dataclass +from typing import Any, Optional + +from common.protocol import DisaggregatedTypeConverter, TRTLLMWorkerRequest +from tensorrt_llm import SamplingParams +from tensorrt_llm.llmapi.llm_utils import update_llm_args_with_extra_options +from tensorrt_llm.llmapi.tokenizer import tokenizer_factory +from tensorrt_llm.serve.openai_protocol import ( + DisaggregatedParams as OAIDisaggregatedParams, +) + +from dynamo.llm import get_tensorrtllm_engine, get_tensorrtllm_publisher +from dynamo.runtime import DistributedRuntime + +logger = logging.getLogger(__name__) + +logger.setLevel(logging.DEBUG) + +# Default buffer size for kv cache events. +DEFAULT_KV_EVENT_BUFFER_MAX_SIZE = 1024 + + +def parse_endpoint(endpoint: str) -> tuple[str, str, str]: + endpoint_str = endpoint.replace("dyn://", "", 1) + endpoint_parts = endpoint_str.split(".") + if len(endpoint_parts) != 3: + raise ValueError( + f"Invalid endpoint format: '{endpoint}'. " + "Expected 'dyn://namespace.component.endpoint' or 'namespace.component.endpoint'." + ) + + return (endpoint_parts[0], endpoint_parts[1], endpoint_parts[2]) + + +@dataclass +class BaseEngineConfig: + """Base engine configuration""" + + namespace: str + component: str + endpoint: str + model_path: str + served_model_name: Optional[str] = None + kv_block_size: int = 32 + extra_engine_args: str = "" + publish_events_and_metrics: bool = False + disaggregation_mode: str = "prefill_and_decode" + remote_prefill_endpoint: Optional[str] = None + lease_id: int = 0 + + def __str__(self) -> str: + return ( + f"Config(namespace={self.namespace}, " + f"component={self.component}, " + f"endpoint={self.endpoint}, " + f"model_path={self.model_path}, " + f"served_model_name={self.served_model_name}, " + f"kv_block_size={self.kv_block_size}, " + f"extra_engine_args={self.extra_engine_args}, " + f"publish_events_and_metrics={self.publish_events_and_metrics}, " + f"disaggregation_mode={self.disaggregation_mode}, " + f"remote_prefill_endpoint={self.remote_prefill_endpoint}, " + f"lease_id={self.lease_id})" + ) + + +class BaseTensorrtLLMEngine: + def __init__( + self, + config: BaseEngineConfig, + ): + self._config = config + self._prefill_client = None + self._llm_engine = None + self._llm_engine_context = None + self._llm_publisher = None + self._llm_publisher_context = None + self._runtime = None + self._first_generation = True + # Initialize default sampling params + self.default_sampling_params = SamplingParams() + + async def initialize(self, runtime: DistributedRuntime): + """Initialize the engine and prefill client if needed""" + self._runtime = runtime + + # Convert model path to Path object if it's a local path, otherwise keep as string + model_path = str(self._config.model_path) + + # Initialize the LLM engine + engine_args: dict[str, Any] = { + "model": model_path, + "tensor_parallel_size": 1, + "backend": "pytorch", + "skip_tokenizer_init": True, + } + + if self._config.extra_engine_args: + # TODO: Support extra engine args from json file as well. + engine_args = update_llm_args_with_extra_options( + engine_args, self._config.extra_engine_args + ) + # Update the model path in the config to the model path used by the engine. + self._config.model_path = str(engine_args["model"]) + if not self._config.model_path: + raise ValueError( + "Model specification is required. Present neither in the config nor in the extra engine args." + ) + + # Populate default sampling params from the model + tokenizer = tokenizer_factory(self._config.model_path) + self.default_sampling_params = SamplingParams() + self.default_sampling_params._setup(tokenizer) + self.default_sampling_params.stop = None + + if self._config.publish_events_and_metrics: + # 'event_buffer_max_size' is required to enable TRTLLM to publish kv cache events. + kv_cache_config: dict[str, Any] | Any = None + if "kv_cache_config" not in engine_args: + kv_cache_config = {} + kv_cache_config[ + "event_buffer_max_size" + ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE + else: + kv_cache_config = engine_args["kv_cache_config"] + if ( + hasattr(kv_cache_config, "event_buffer_max_size") + and not kv_cache_config.event_buffer_max_size + ): + kv_cache_config.event_buffer_max_size = ( + DEFAULT_KV_EVENT_BUFFER_MAX_SIZE + ) + elif ( + isinstance(kv_cache_config, dict) + and "event_buffer_max_size" not in kv_cache_config + ): + kv_cache_config[ + "event_buffer_max_size" + ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE + engine_args["kv_cache_config"] = kv_cache_config + + # Enable iter perf stats by default if we are publishing events and metrics. + if not engine_args.get("enable_iter_perf_stats"): + engine_args["enable_iter_perf_stats"] = True + + # Only pytorch backend is supported for now to publish events and metrics. + if engine_args.get("backend") != "pytorch": + logging.error( + "Only pytorch backend is supported for now to publish events and metrics." + ) + raise RuntimeError( + "Only pytorch backend is supported for now to publish events and metrics. Hence, KV router is not supported." + ) + + logging.info(f"TRTLLM engine args: {engine_args}") + + # Get the engine using the asynccontextmanager + self._llm_engine_context = get_tensorrtllm_engine(engine_args) + if self._llm_engine_context is not None: + self._llm_engine = await self._llm_engine_context.__aenter__() + else: + raise RuntimeError("Failed to create LLM engine context") + + if ( + self._config.publish_events_and_metrics + and self._config.disaggregation_mode != "prefill" + ): + kv_listener = runtime.namespace(self._config.namespace).component( + self._config.component + ) + self._llm_publisher_context = get_tensorrtllm_publisher( + kv_listener, + self._llm_engine, + kv_listener, + self._config.lease_id, + self._config.kv_block_size, + ) + if self._llm_publisher_context is not None: + self._llm_publisher = await self._llm_publisher_context.__aenter__() + else: + raise RuntimeError("Failed to create LLM publisher context") + + # Initialize prefill client if in decode mode + if self._config.disaggregation_mode == "decode": + if self._config.remote_prefill_endpoint is None: + raise ValueError("remote_prefill_endpoint is required for decode mode") + logging.info( + f"Initializing remote prefill client for endpoint: {self._config.remote_prefill_endpoint}" + ) + ( + parsed_namespace, + parsed_component_name, + parsed_endpoint_name, + ) = parse_endpoint(self._config.remote_prefill_endpoint) + if self._runtime is not None: + self._prefill_client = ( + await self._runtime.namespace(parsed_namespace) + .component(parsed_component_name) + .endpoint(parsed_endpoint_name) + .client() + ) + else: + raise RuntimeError("Runtime not initialized") + + async def cleanup(self): + """Cleanup resources""" + if self._llm_publisher_context: + try: + await self._llm_publisher_context.__aexit__(None, None, None) + except Exception as e: + logging.error(f"Error during publisher cleanup: {e}") + finally: + self._llm_publisher = None + self._llm_publisher_context = None + + if self._llm_engine_context: + try: + await self._llm_engine_context.__aexit__(None, None, None) + except Exception as e: + logging.error(f"Error during engine cleanup: {e}") + finally: + self._llm_engine = None + self._llm_engine_context = None + + self._prefill_client = None + + async def remote_prefill(self, request: TRTLLMWorkerRequest): + """ + Send a prefill request to the remote prefill worker. + + Args: + request: The original request to be sent for prefill + + Returns: + The response from the remote prefill worker + + Raises: + ValueError: If prefill client is not initialized or multiple responses received + """ + prefill_request = request.model_copy(deep=True) + # TRTLLM requires max_tokens to be set for prefill requests. + prefill_request.stop_conditions.max_tokens = 1 + prefill_request.disaggregated_params = OAIDisaggregatedParams( + request_type="context_only" + ) + + if self._prefill_client is None: + raise ValueError("Prefill client not initialized") + try: + # TODO: Use smart KV router to determine which prefill worker to use. This would also require supporting publishing events for prefill workers. + remote_prefill_responses = [ + remote_prefill_response + async for remote_prefill_response in await self._prefill_client.round_robin( + prefill_request.model_dump_json() + ) + ] + except Exception as e: + raise ValueError(f"Error in remote prefill: {e}") + + if len(remote_prefill_responses) > 1: + raise ValueError( + "Prefill worker returned more than one response. This is currently not supported in remote prefill mode." + ) + + if len(remote_prefill_responses) == 0: + raise ValueError("No response received from remote prefill worker") + + remote_prefill_response = remote_prefill_responses[0] + return remote_prefill_response + + async def generate(self, request: TRTLLMWorkerRequest): + if self._llm_engine is None: + raise RuntimeError("Engine not initialized") + + if self._llm_publisher: + publishers_error = self._llm_publisher.check_error_queue() + if publishers_error: + raise publishers_error + + inputs = request.token_ids + + # Decode the disaggregated params from the request + disaggregated_params = DisaggregatedTypeConverter.to_llm_disaggregated_params( + request.disaggregated_params + ) + num_output_tokens_so_far = 0 + + if self._config.disaggregation_mode == "decode": + # Run prefill/context phase remotely if disaggregation mode is decode. + try: + prefill_result = await self.remote_prefill(request) + except Exception as e: + raise ValueError(f"Error in remote prefill: {e}") + + remote_prefill_response = prefill_result.data() + if ( + remote_prefill_response["finish_reason"] == "stop" + or remote_prefill_response["finish_reason"] == "error" + ): + yield remote_prefill_response + return + num_output_tokens_so_far = len(remote_prefill_response["token_ids"]) + + # Decode the disaggregated params from the remote prefill response + # Decode the disaggregated params from the remote prefill response + disaggregated_params = ( + DisaggregatedTypeConverter.to_llm_disaggregated_params( + OAIDisaggregatedParams( + **remote_prefill_response["disaggregated_params"] + ) + ) + ) + + # Send the first token response to the client + first_token_response = remote_prefill_response + first_token_response.pop("disaggregated_params") + yield first_token_response + + # Set the disaggregated params to generation_only for the rest of the generation + disaggregated_params.request_type = "generation_only" + + sampling_params = self.default_sampling_params + for key, value in request.sampling_options.model_dump().items(): + if not value: + continue + if hasattr(sampling_params, key): + setattr(sampling_params, key, value) + + max_tokens = request.stop_conditions.max_tokens + if max_tokens: + sampling_params.max_tokens = max_tokens + + ignore_eos = request.stop_conditions.ignore_eos + if ignore_eos: + sampling_params.ignore_eos = ignore_eos + + # TODO: Disable streaming for context only requests when adding disagg support + async for res in self._llm_engine.llm.generate_async( + inputs=inputs, + sampling_params=sampling_params, + disaggregated_params=disaggregated_params, + streaming=(self._config.disaggregation_mode != "prefill"), + ): + # TRTLLM engine needs to start generating tokens first before stats + # can be retrieved. + if self._first_generation and self._llm_publisher: + self._llm_publisher.start() + self._first_generation = False + + if res.finished and self._config.disaggregation_mode != "prefill": + yield {"finish_reason": "stop", "token_ids": []} + break + + if not res.outputs: + yield {"finish_reason": "error", "token_ids": []} + break + + output = res.outputs[0] + next_total_toks = len(output.token_ids) + out = {"token_ids": output.token_ids[num_output_tokens_so_far:]} + if output.finish_reason: + out["finish_reason"] = output.finish_reason + if output.stop_reason: + out["stop_reason"] = output.stop_reason + if self._config.disaggregation_mode == "prefill": + # Return the disaggregated params only when operating in prefill mode. + out[ + "disaggregated_params" + ] = DisaggregatedTypeConverter.to_oai_disaggregated_params( + output.disaggregated_params + ).model_dump() + + yield out + num_output_tokens_so_far = next_total_toks diff --git a/examples/tensorrt_llm_sd/common/parser.py b/examples/tensorrt_llm_sd/common/parser.py new file mode 100644 index 0000000000..67bb230796 --- /dev/null +++ b/examples/tensorrt_llm_sd/common/parser.py @@ -0,0 +1,62 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + + +def parse_tensorrt_llm_args( + config_args, +) -> argparse.Namespace: + parser = argparse.ArgumentParser(description="A TensorRT-LLM Worker parser") + parser.add_argument( + "--extra-engine-args", + type=str, + default="", + help="Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.", + ) + parser.add_argument( + "--model-path", + type=str, + default=None, + help="Path to disk model or HuggingFace model identifier to load.", + ) + parser.add_argument( + "--served_model_name", + type=str, + help="Name to serve the model under.", + ) + parser.add_argument( + "--router", + type=str, + choices=["random", "round-robin", "kv"], + default="random", + help="Router type to use for scheduling requests to workers", + ) + + parser.add_argument( + "--kv-block-size", + type=int, + default=32, + help="Number of tokens per KV block in TRTLLM worker. Default is 32 for pytorch backend.", + ) + + parser.add_argument( + "--enable-disagg", + action="store_true", + help="Enable remote prefill for the worker", + ) + + args = parser.parse_args(config_args) + return args diff --git a/examples/tensorrt_llm_sd/common/protocol.py b/examples/tensorrt_llm_sd/common/protocol.py new file mode 100644 index 0000000000..f05cdb9f8f --- /dev/null +++ b/examples/tensorrt_llm_sd/common/protocol.py @@ -0,0 +1,104 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import base64 +from typing import List, Optional + +from pydantic import BaseModel, Field +from tensorrt_llm.llmapi import DisaggregatedParams as LlmDisaggregatedParams +from tensorrt_llm.serve.openai_protocol import DisaggregatedParams + + +class Tokens(BaseModel): + tokens: list[int] + + +TokenIdType = int + + +class DisaggregatedTypeConverter: + @staticmethod + def to_llm_disaggregated_params( + disaggregated_params: DisaggregatedParams, + ) -> LlmDisaggregatedParams: + if disaggregated_params is None: + return None + else: + opaque_state = ( + base64.b64decode(disaggregated_params.encoded_opaque_state) + if disaggregated_params.encoded_opaque_state is not None + else None + ) + + return LlmDisaggregatedParams( + request_type=disaggregated_params.request_type, + first_gen_tokens=disaggregated_params.first_gen_tokens, + ctx_request_id=disaggregated_params.ctx_request_id, + opaque_state=opaque_state, + ) + + @staticmethod + def to_oai_disaggregated_params( + tllm_disagg_params: LlmDisaggregatedParams, + ) -> DisaggregatedParams: + if tllm_disagg_params is None: + return None + else: + encoded_opaque_state = ( + base64.b64encode(tllm_disagg_params.opaque_state).decode("utf-8") + if tllm_disagg_params.opaque_state is not None + else None + ) + return DisaggregatedParams( + request_type=tllm_disagg_params.request_type, + first_gen_tokens=tllm_disagg_params.first_gen_tokens, + ctx_request_id=tllm_disagg_params.ctx_request_id, + encoded_opaque_state=encoded_opaque_state, + ) + + +# TODO: move these to common for all LLMs once we adopt dynamo-run +# derived from lib/llm/src/protocols/common/preprocessor.rs +class StopConditions(BaseModel): + max_tokens: Optional[int] = None + stop: Optional[List[str]] = None + stop_token_ids_hidden: Optional[List[TokenIdType]] = None + min_tokens: Optional[int] = None + ignore_eos: Optional[bool] = None + + +class SamplingOptions(BaseModel): + n: Optional[int] = None + best_of: Optional[int] = None + presence_penalty: Optional[float] = None + frequency_penalty: Optional[float] = None + repetition_penalty: Optional[float] = None + temperature: Optional[float] = None + top_p: Optional[float] = None + top_k: Optional[int] = None + min_p: Optional[float] = None + use_beam_search: Optional[bool] = None + length_penalty: Optional[float] = None + seed: Optional[int] = None + + +class TRTLLMWorkerRequest(BaseModel): + token_ids: List[TokenIdType] + stop_conditions: StopConditions + sampling_options: SamplingOptions + eos_token_ids: List[TokenIdType] = Field(default_factory=list) + mdc_sum: Optional[str] = None + annotations: List[str] = Field(default_factory=list) + estimated_prefix_hit_num_blocks: Optional[int] = None + disaggregated_params: Optional[DisaggregatedParams] = Field(default=None) diff --git a/examples/tensorrt_llm_sd/components/frontend.py b/examples/tensorrt_llm_sd/components/frontend.py new file mode 100644 index 0000000000..98be2dfa33 --- /dev/null +++ b/examples/tensorrt_llm_sd/components/frontend.py @@ -0,0 +1,119 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import subprocess +from pathlib import Path + +from components.worker import TensorRTLLMWorker +from fastapi import FastAPI +from pydantic import BaseModel + +from dynamo import sdk +from dynamo.sdk import depends, service +from dynamo.sdk.lib.config import ServiceConfig +from dynamo.sdk.lib.image import DYNAMO_IMAGE + +logger = logging.getLogger(__name__) + + +def get_dynamo_run_binary(): + """Find the dynamo-run binary path in SDK or fallback to 'dynamo-run' command.""" + sdk_path = Path(sdk.__file__) + binary_path = sdk_path.parent / "cli/bin/dynamo-run" + if not binary_path.exists(): + return "dynamo-run" + else: + return str(binary_path) + + +class FrontendConfig(BaseModel): + """Configuration for the Frontend service including model and HTTP server settings.""" + + served_model_name: str + endpoint: str + port: int = 8000 + router: str = "round-robin" + block_size: int = 32 + + +# todo this should be called ApiServer +@service( + dynamo={ + "namespace": "dynamo", + }, + workers=1, + image=DYNAMO_IMAGE, + app=FastAPI(title="TensorRT-LLM Example"), +) +class Frontend: + worker = depends(TensorRTLLMWorker) + + def __init__(self): + """Initialize Frontend service with HTTP server and model configuration.""" + self.frontend_config = FrontendConfig( + **ServiceConfig.get_parsed_config("Frontend") + ) + self.process = None + + logger.warning(f"Frontend config: {self.frontend_config}") + + self.start_ingress_and_processor() + + def start_ingress_and_processor(self): + """Starting dynamo-run based ingress and processor""" + logger.info( + f"Starting HTTP server and processor on port {self.frontend_config.port}" + ) + dynamo_run_binary = get_dynamo_run_binary() + + cmd = [ + dynamo_run_binary, + "in=http", + "out=dyn", + "--http-port", + str(self.frontend_config.port), + "--router-mode", + self.frontend_config.router, + ] + + logger.info(f"Frontend cmd: {cmd}") + + self.process = subprocess.Popen( + cmd, + stdout=None, + stderr=None, + ) + + def close(self): + """Clean up resources by terminating the subprocess.""" + if self.process is not None: + try: + logger.info("Terminating subprocess...") + self.process.terminate() + # Wait for process to terminate with a timeout + self.process.wait(timeout=5) + except subprocess.TimeoutExpired: + logger.warning("Subprocess did not terminate gracefully, forcing kill") + self.process.kill() + self.process.wait() + except Exception as e: + logger.error(f"Error while terminating subprocess: {e}") + finally: + self.process = None + + def __del__(self): + """Destructor to ensure subprocess is cleaned up.""" + self.close() diff --git a/examples/tensorrt_llm_sd/components/prefill_worker.py b/examples/tensorrt_llm_sd/components/prefill_worker.py new file mode 100644 index 0000000000..7e43d1fca7 --- /dev/null +++ b/examples/tensorrt_llm_sd/components/prefill_worker.py @@ -0,0 +1,75 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + +from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine +from common.parser import parse_tensorrt_llm_args +from common.protocol import TRTLLMWorkerRequest + +from dynamo.sdk import async_on_start, dynamo_context, endpoint, on_shutdown, service +from dynamo.sdk.lib.config import ServiceConfig + +logger = logging.getLogger(__name__) + + +@service( + dynamo={ + "namespace": "dynamo", + }, + resources={"gpu": 1, "cpu": "10", "memory": "20Gi"}, + workers=1, +) +class TensorRTLLMPrefillWorker(BaseTensorrtLLMEngine): + def __init__(self): + logger.info("Initializing TensorRT-LLM Prefill Worker") + class_name = self.__class__.__name__ + config = ServiceConfig.get_instance() + config_args = config.as_args(class_name, prefix="") + args = parse_tensorrt_llm_args(config_args) + lease_id = dynamo_context["endpoints"][0].lease_id() + namespace, _ = TensorRTLLMPrefillWorker.dynamo_address() # type: ignore + + engine_config = BaseEngineConfig( + namespace=namespace, + component=class_name, + endpoint="generate", + model_path=args.model_path, + served_model_name=args.served_model_name, + kv_block_size=args.kv_block_size, + extra_engine_args=args.extra_engine_args, + publish_events_and_metrics=False, + disaggregation_mode="prefill", + remote_prefill_endpoint=None, + lease_id=lease_id, + ) + + super().__init__(config=engine_config) + + @async_on_start + async def async_init(self): + runtime = dynamo_context["runtime"] + await self.initialize(runtime) + logger.info("TensorRT-LLM Prefill Worker initialized") + + @on_shutdown + async def async_cleanup(self): + logger.info("Cleaning up TensorRT-LLM Prefill Worker") + await self.cleanup() + logger.info("TensorRT-LLM Prefill Worker cleanup completed") + + @endpoint() + async def generate(self, request: TRTLLMWorkerRequest): + async for response in super().generate(request): + yield response diff --git a/examples/tensorrt_llm_sd/components/worker.py b/examples/tensorrt_llm_sd/components/worker.py new file mode 100644 index 0000000000..9074bfbe8d --- /dev/null +++ b/examples/tensorrt_llm_sd/components/worker.py @@ -0,0 +1,115 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + +from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine +from common.parser import parse_tensorrt_llm_args +from common.protocol import TRTLLMWorkerRequest +from components.prefill_worker import TensorRTLLMPrefillWorker + +from dynamo.llm import ModelType, register_llm +from dynamo.sdk import ( + async_on_start, + depends, + dynamo_context, + endpoint, + on_shutdown, + service, +) +from dynamo.sdk.lib.config import ServiceConfig + +logger = logging.getLogger(__name__) + + +@service( + dynamo={ + "namespace": "dynamo", + }, + resources={"gpu": 1, "cpu": "10", "memory": "20Gi"}, + workers=1, +) +class TensorRTLLMWorker(BaseTensorrtLLMEngine): + prefill_worker = depends(TensorRTLLMPrefillWorker) + + def __init__(self): + logger.info("Initializing TensorRT-LLM Worker") + class_name = self.__class__.__name__ + config = ServiceConfig.get_instance() + config_args = config.as_args(class_name, prefix="") + args = parse_tensorrt_llm_args(config_args) + lease_id = dynamo_context["endpoints"][0].lease_id() + namespace, _ = TensorRTLLMWorker.dynamo_address() # type: ignore + endpoint_name = "generate" + publish_events_and_metrics = args.router == "kv" + prefill_class_name = "TensorRTLLMPrefillWorker" + + if args.enable_disagg: + disaggregation_mode = "decode" + else: + disaggregation_mode = "prefill_and_decode" + + engine_config = BaseEngineConfig( + namespace=namespace, + component=class_name, + endpoint=endpoint_name, + model_path=args.model_path, + served_model_name=args.served_model_name, + kv_block_size=args.kv_block_size, + extra_engine_args=args.extra_engine_args, + publish_events_and_metrics=publish_events_and_metrics, + disaggregation_mode=disaggregation_mode, + remote_prefill_endpoint=f"dyn://{namespace}.{prefill_class_name}.generate", + lease_id=lease_id, + ) + + super().__init__(config=engine_config) + + @async_on_start + async def async_init(self): + runtime = dynamo_context["runtime"] + await self.initialize(runtime) + + logger.info("Registering LLM for discovery") + endpoint = ( + runtime.namespace(self._config.namespace) + .component(self._config.component) + .endpoint(self._config.endpoint) + ) + + try: + await register_llm( + ModelType.Backend, + endpoint, + self._config.model_path, + self._config.served_model_name, + kv_cache_block_size=self._config.kv_block_size, + ) + logger.info("Successfully registered LLM for discovery") + except Exception as e: + logger.error(f"Failed to register LLM for discovery: {e}") + raise + + logger.info("TensorRT-LLM Worker initialized") + + @on_shutdown + async def async_cleanup(self): + logger.info("Cleaning up TensorRT-LLM Worker") + await self.cleanup() + logger.info("TensorRT-LLM Worker cleanup completed") + + @endpoint() + async def generate(self, request: TRTLLMWorkerRequest): + async for response in super().generate(request): + yield response diff --git a/examples/tensorrt_llm_sd/configs/agg.yaml b/examples/tensorrt_llm_sd/configs/agg.yaml new file mode 100644 index 0000000000..a3d4594ed8 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/agg.yaml @@ -0,0 +1,34 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: round-robin + +TensorRTLLMWorker: + # Path to disk model or HuggingFace model identifier to load + model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Name to serve the model under + served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/engine_configs/agg_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 1 \ No newline at end of file diff --git a/examples/tensorrt_llm_sd/configs/agg_router.yaml b/examples/tensorrt_llm_sd/configs/agg_router.yaml new file mode 100644 index 0000000000..58f2a82ab3 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/agg_router.yaml @@ -0,0 +1,34 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: kv + +TensorRTLLMWorker: + # Path to disk model or HuggingFace model identifier to load + model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Name to serve the model under + served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/engine_configs/agg_config.yaml" + router: kv + ServiceArgs: + workers: 1 + resources: + gpu: 1 \ No newline at end of file diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml new file mode 100644 index 0000000000..f7cec35e7d --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml @@ -0,0 +1,35 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + # This is the client-facing model name, you can set this to anything you'd like. + served_model_name: "nvidia/DeepSeek-R1-FP4" + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: round-robin + +TensorRTLLMWorker: + served_model_name: "nvidia/DeepSeek-R1-FP4" + # NOTE: FP4 only supported starting with Blackwell GPUs. + # https://huggingface.co/nvidia/DeepSeek-R1-FP4 + # You can also specify the full path to locally downloaded weights + # instead of a HuggingFace ID here. + model-path: "nvidia/DeepSeek-R1-FP4" + extra-engine-args: "configs/deepseek_r1/engine_configs/agg_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 4 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml new file mode 100644 index 0000000000..9d96befbe5 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml @@ -0,0 +1,49 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + # This is the client-facing model name, you can set this to anything you'd like. + served_model_name: "nvidia/DeepSeek-R1-FP4" + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: round-robin + +TensorRTLLMWorker: + served_model_name: "nvidia/DeepSeek-R1-FP4" + # NOTE: FP4 only supported starting with Blackwell GPUs. + # https://huggingface.co/nvidia/DeepSeek-R1-FP4 + # You can also specify the full path to locally downloaded weights + # instead of a HuggingFace ID here. + model-path: "nvidia/DeepSeek-R1-FP4" + extra-engine-args: "configs/deepseek_r1/engine_configs/decode_config.yaml" + enable-disagg: true + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 4 + +TensorRTLLMPrefillWorker: + # NOTE: FP4 only supported starting with Blackwell GPUs. + # https://huggingface.co/nvidia/DeepSeek-R1-FP4 + # You can also specify the full path to locally downloaded weights + # instead of a HuggingFace ID here. + model-path: "nvidia/DeepSeek-R1-FP4" + extra-engine-args: "configs/deepseek_r1/engine_configs/prefill_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 4 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml new file mode 100644 index 0000000000..29dddba56f --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml @@ -0,0 +1,54 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +backend: pytorch + +# TP/EP/PP/DP +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +pipeline_parallel_size: 1 +enable_attention_dp: false + +max_batch_size: 256 +# 8448 = 8192 ISL + 256 OSL +max_num_tokens: 8448 +max_seq_len: 8448 + +kv_cache_config: + # With dp attention disabled: high free_gpu_memory_fraction is fine. + free_gpu_memory_fraction: 0.85 + # With dp attention enabled: large ISL at high concurrency may need + # free_gpu_memory_fraction low to have enough available memory. + # free_gpu_memory_fraction: 0.30 + +# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 +# NOTE: overlap_scheduler enabled by default since this commit and changed +# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': +# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 +use_cuda_graph: true +cuda_graph_padding_enabled: true +# NOTE: For larger max batch size, you may want to add larger cuda graph +# batch sizes below to match. +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +print_iter_log: true +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml new file mode 100644 index 0000000000..772b94b283 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml @@ -0,0 +1,55 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +backend: pytorch + +# TP/EP/PP/DP +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +pipeline_parallel_size: 1 +enable_attention_dp: false + +max_batch_size: 256 +max_num_tokens: 256 +# 8448 = 8192 ISL + 256 OSL +max_seq_len: 8448 + +kv_cache_config: + # With dp attention disabled: high free_gpu_memory_fraction is fine. + free_gpu_memory_fraction: 0.85 + # With dp attention enabled: large ISL at high concurrency may need + # free_gpu_memory_fraction low to have enough available memory. + # free_gpu_memory_fraction: 0.30 + +# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 +# NOTE: overlap_scheduler enabled by default since this commit and changed +# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': +# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 +disable_overlap_scheduler: false +use_cuda_graph: true +cuda_graph_padding_enabled: true +# NOTE: For larger max batch size, you may want to add larger cuda graph +# batch sizes below to match. +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +print_iter_log: true +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml new file mode 100644 index 0000000000..6ae899a68a --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml @@ -0,0 +1,37 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +backend: pytorch + +# TP/EP/PP/DP +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +pipeline_parallel_size: 1 +enable_attention_dp: true + +max_batch_size: 1 +max_num_tokens: 8192 +max_seq_len: 8192 + +kv_cache_config: + free_gpu_memory_fraction: 0.75 + +# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 +# NOTE: overlap_scheduler enabled by default since this commit and changed +# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': +# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 +disable_overlap_scheduler: true +print_iter_log: true +# NOTE: This dtype must match in both prefill/decode configs +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml new file mode 100644 index 0000000000..f0b5411221 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml @@ -0,0 +1,50 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NOTE: FP4 only supported starting with Blackwell GPUs. +# https://huggingface.co/nvidia/DeepSeek-R1-FP4 +# You can also specify the full path to locally downloaded weights +# instead of a HuggingFace ID here. + +backend: pytorch +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +enable_attention_dp: true +max_batch_size: 256 +# 8448 = 8192 ISL + 256 OSL +max_num_tokens: 8448 +max_seq_len: 8448 +kv_cache_config: + free_gpu_memory_fraction: 0.30 + +# Enable the MTP(Multi-Token Prediction) in the model engine +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +use_cuda_graph: true +cuda_graph_padding_enabled: true +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +print_iter_log: true +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml new file mode 100644 index 0000000000..ab48b2e78b --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml @@ -0,0 +1,53 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NOTE: FP4 only supported starting with Blackwell GPUs. +# https://huggingface.co/nvidia/DeepSeek-R1-FP4 +# You can also specify the full path to locally downloaded weights +# instead of a HuggingFace ID here. + +backend: pytorch +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +enable_attention_dp: false +max_batch_size: 256 +# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula: +# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1) +# This is a known issue in TensorRT-LLM and will be resolved in the next release. +max_num_tokens: 512 +# 8704 = 8192 ISL + 512 OSL +max_seq_len: 8704 +kv_cache_config: + free_gpu_memory_fraction: 0.85 + +# Enable the MTP(Multi-Token Prediction) in decode model engine +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +use_cuda_graph: true +cuda_graph_padding_enabled: true +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +print_iter_log: true +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml new file mode 100644 index 0000000000..ee6ee26a94 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml @@ -0,0 +1,37 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NOTE: FP4 only supported starting with Blackwell GPUs. +# https://huggingface.co/nvidia/DeepSeek-R1-FP4 +# You can also specify the full path to locally downloaded weights +# instead of a HuggingFace ID here. + +backend: pytorch +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +enable_attention_dp: true +max_batch_size: 1 +max_num_tokens: 8192 +max_seq_len: 8192 +kv_cache_config: + free_gpu_memory_fraction: 0.75 +print_iter_log: true +kv_cache_dtype: fp8 +disable_overlap_scheduler: true + +# Enable the MTP(Multi-Token Prediction) in the prefill model engine +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml new file mode 100644 index 0000000000..c51abf9d95 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml @@ -0,0 +1,36 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + served_model_name: "nvidia/DeepSeek-R1-FP4" + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: round-robin + +TensorRTLLMWorker: + served_model_name: "nvidia/DeepSeek-R1-FP4" + # NOTE: FP4 only supported starting with Blackwell GPUs. + # https://huggingface.co/nvidia/DeepSeek-R1-FP4 + # You can also specify the full path to locally downloaded weights + # instead of a HuggingFace ID here. + model-path: "nvidia/DeepSeek-R1-FP4" + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/agg_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 4 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml new file mode 100644 index 0000000000..5fe2679809 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml @@ -0,0 +1,52 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + served_model_name: "nvidia/DeepSeek-R1-FP4" + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: round-robin + +TensorRTLLMWorker: + served_model_name: "nvidia/DeepSeek-R1-FP4" + # NOTE: FP4 only supported starting with Blackwell GPUs. + # https://huggingface.co/nvidia/DeepSeek-R1-FP4 + # You can also specify the full path to locally downloaded weights + # instead of a HuggingFace ID here. + model-path: "nvidia/DeepSeek-R1-FP4" + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/decode_config.yaml" + router: round-robin + enable-disagg: true + ServiceArgs: + workers: 1 + resources: + gpu: 4 + +TensorRTLLMPrefillWorker: + # NOTE: FP4 only supported starting with Blackwell GPUs. + # https://huggingface.co/nvidia/DeepSeek-R1-FP4 + # You can also specify the full path to locally downloaded weights + # instead of a HuggingFace ID here. + model-path: "nvidia/DeepSeek-R1-FP4" + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 4 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md new file mode 100644 index 0000000000..342cd45129 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md @@ -0,0 +1,275 @@ + + +# Example: Multi-node TRTLLM Workers with Dynamo on Slurm + +To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16), +the set of nodes need to be launched together in the same MPI world, such as +via `mpirun` or `srun`. This is true regardless of whether the worker is +aggregated, prefill-only, or decode-only. + +In this document we will demonstrate two examples launching multinode workers +on a slurm cluster with `srun`: +1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16 + worker across 4 GB200 nodes +2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node + TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode + worker (4 nodes) across a total of 8 GB200 nodes. + +NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and +`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or +using `mpirun` directly, with relative ease. + +## Setup + +For simplicity of the example, we will make some assumptions about your slurm cluster: +1. First, we assume you have access to a slurm cluster with multiple GPU nodes + available. For functional testing, most setups should be fine. For performance + testing, you should aim to allocate groups of nodes that are performantly + inter-connected, such as those in an NVL72 setup. +2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis) + SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this + example will use `srun` arguments like `--container-image`, + `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis. + If your cluster supports similar container based plugins, you may be able to + modify the script to use that instead. +3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as + described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker). + This is the image that can be set to the `IMAGE` environment variable in later steps. +4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We + will allocate 8 nodes below as a reference command to have enough capacity + to run both examples. If you plan to only run the aggregated example, you + will only need 4 nodes. If you customize the configurations to require a + different number of nodes, you can adjust the number of allocated nodes + accordingly. Pre-allocating nodes is technically not a requirement, + but it makes iterations of testing/experimenting easier. + + Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup: + ```bash + # Set partition manually based on your slurm cluster's partition names + PARTITION="" + # Set account manually if this command doesn't work on your cluster + ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)" + salloc \ + --partition="${PARTITION}" \ + --account="${ACCOUNT}" \ + --job-name="${ACCOUNT}-dynamo.trtllm" \ + -t 05:00:00 \ + --nodes 8 + ``` +5. Lastly, we will assume you are inside an interactive shell on one of your allocated + nodes, which may be the default behavior after executing the `salloc` command above + depending on the cluster setup. If not, then you should SSH into one of the allocated nodes. + +### Environment Variable Setup + +This example aims to automate as much of the environment setup as possible, +but all slurm clusters and environments are different, and you may need to +dive into the scripts to make modifications based on your specific environment. + +Assuming you have already allocated your nodes via `salloc`, and are +inside an interactive shell on one of the allocated nodes, set the +following environment variables based: +```bash +# NOTE: IMAGE must be set manually for now +# To build an iamge, see the steps here: +# https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker +export IMAGE="" + +# MOUNTS are the host:container path pairs that are mounted into the containers +# launched by each `srun` command. +# +# If you want to reference files, such as $MODEL_PATH below, in a +# different location, you can customize MOUNTS or specify additional +# comma-separated mount pairs here. +# +# NOTE: Currently, this example assumes that the local bash scripts and configs +# referenced are mounted into into /mnt inside the container. If you want to +# customize the location of the scripts, make sure to modify `srun_aggregated.sh` +# accordingly for the new locations of `start_frontend_services.sh` and +# `start_trtllm_worker.sh`. +# +# For example, assuming your cluster had a `/lustre` directory on the host, you +# could add that as a mount like so: +# +# export MOUNTS="${PWD}:/mnt,/lustre:/lustre" +export MOUNTS="${PWD}:/mnt" + +# NOTE: In general, Deepseek R1 is very large, so it is recommended to +# pre-download the model weights and save them in some shared location, +# NFS storage, HF_CACHE, etc. and modify the `--model-path` below +# to reuse the pre-downloaded weights instead. +# +# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights: +# https://huggingface.co/nvidia/DeepSeek-R1-FP4 +# +# On Hopper systems, FP4 isn't supported so you'll need to use the default weights: +# https://huggingface.co/deepseek-ai/DeepSeek-R1 +export MODEL_PATH="nvidia/DeepSeek-R1-FP4" + +# The name the model will be served/queried under, matching what's +# returned by the /v1/models endpoint. +# +# By default this is inferred from MODEL_PATH, but when using locally downloaded +# model weights, it can be nice to have explicit control over the name. +export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4" +``` + +## Aggregated WideEP + +Assuming you have at least 4 nodes allocated following the setup steps above, +follow these steps below to launch an **aggregated** deployment across 4 nodes: + +```bash +# Default set in srun_aggregated.sh, but can customize here. +# export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg.yaml" + +# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG +# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of +# total GPUs necessary to satisfy the requested parallelism. For example, +# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16. +# export NUM_NODES=4 + +# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this. +# export NUM_GPUS_PER_NODE=4 + +# Launches: +# - frontend + etcd/nats on current (head) node +# - one large aggregated trtllm worker across multiple nodes via MPI tasks +./srun_aggregated.sh +``` + +## Disaggregated WideEP + +Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode) +following the setup above, follow these steps below to launch a **disaggregated** +deployment across 8 nodes: + +> [!Tip] +> Make sure you have a fresh environment and don't still have the aggregated +> example above still deployed on the same set of nodes. + +```bash +# Defaults set in srun_disaggregated.sh, but can customize here. +# export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml" +# export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml" + +# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG +# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG +# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and +# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of +# GPUs necessary to satisfy the requested parallelism in each config. +# export NUM_PREFILL_NODES=4 +# export NUM_DECODE_NODES=4 + +# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this. +# export NUM_GPUS_PER_NODE=4 + +# Launches: +# - frontend + etcd/nats on current (head) node. +# - one large prefill trtllm worker across multiple nodes via MPI tasks +# - one large decode trtllm worker across multiple nodes via MPI tasks +./srun_disaggregated.sh +``` + +## Understanding the Output + +1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches + etcd, NATS, and the OpenAI frontend on the head node only + called "node1" in the example output below. The second launches + a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node + using 4 GPUs each. + ``` + # Frontend/etcd/nats services + srun: launching StepId=453374.17 on host node1, 1 tasks: 0 + ... + # TP16 TRTLLM worker split across 4 nodes with 4 gpus each + srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3] + srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7] + srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11] + srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15] + ``` +2. The OpenAI frontend will listen for and dynamically discover workers as + they register themselves with Dynamo's distributed runtime: + ``` + 0: 2025-06-13T02:36:48.160Z INFO dynamo_run::input::http: Watching for remote model at models + 0: 2025-06-13T02:36:48.161Z INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000" + ``` +3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each + GPU on each node, which will each output their progress while loading the model. + You can see each rank's output prefixed with the rank at the start of each log line + until the model succesfully finishes loading: + ``` + 8: rank8 run mgmn worker node with mpi_world_size: 16 ... + 10: rank10 run mgmn worker node with mpi_world_size: 16 ... + 9: rank9 run mgmn worker node with mpi_world_size: 16 ... + 11: rank11 run mgmn worker node with mpi_world_size: 16 ... + ... + 15: Model init total -- 55.42s + 11: Model init total -- 55.91s + 12: Model init total -- 55.24s + ``` +4. After the model fully finishes loading on all ranks, the worker will register itself, + and the OpenAI frontend will detect it, signaled by this output: + ``` + 0: 2025-06-13T02:46:35.040Z INFO dynamo_llm::discovery::watcher: added model model_name="nvidia/DeepSeek-R1-FP4" + ``` +5. At this point, with the worker fully initialized and detected by the frontend, + it is now ready for inference. +6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches + three srun jobs instead of two. One for frontend, one for prefill worker, + and one for decode worker. + +## Example Request + +To verify the deployed model is working, send a `curl` request: +```bash +# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead. +HOST=localhost +PORT=8000 +# "model" here should match the model name returned by the /v1/models endpoint +curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "'${SERVED_MODEL_NAME}'", + "messages": [ + { + "role": "user", + "content": "Tell me a story as if we were playing dungeons and dragons." + } + ], + "stream": true, + "max_tokens": 30 +}' +``` + +## Cleanup + +To cleanup background `srun` processes launched by `srun_aggregated.sh` or +`srun_disaggregated.sh`, you can run: +```bash +pkill srun +``` + +## Known Issues + +- This example has only been tested on a 4xGB200 node setup with 16 GPUs using + FP4 weights. In theory, the example should work on alternative setups such as + H100 nodes with FP8 weights, but this hasn't been tested yet. +- This example only tests an aggregated model setup for now. A disaggregated + serving example will be added in the near future. +- WideEP configs in this directory are still being tested. A WideEP specific + example with documentation will be added once ready. +- There are known issues where WideEP workers may not cleanly shut down: + - This may lead to leftover shared memory files in `/dev/shm/moe_*`. For + now, you must manually clean these up before deploying again on the + same set of nodes. + - Similarly, there may be GPU memory left in-use after killing the `srun` + jobs. After cleaning up any leftover shared memory files as described + above, the GPU memory may slowly come back. You can run `watch nvidia-smi` + to check on this behavior. If you don't free the GPU memory before the + next deployment, you may get a CUDA OOM error while loading the model. + - There is mention of this issue in the relevant TRT-LLM blog + [here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous). diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml new file mode 100644 index 0000000000..d697caacfa --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml @@ -0,0 +1,27 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Example of a Multi-node worker, but no WideEP or EPLB. +# See wide_ep*.yaml for WideEP example configs. +backend: pytorch +tensor_parallel_size: 16 +moe_expert_parallel_size: 16 +enable_attention_dp: true +max_batch_size: 256 +max_num_tokens: 256 +max_seq_len: 8448 +kv_cache_config: + free_gpu_memory_fraction: 0.7 +use_cuda_graph: true +cuda_graph_padding_enabled: true +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml new file mode 100644 index 0000000000..f2fe0a13c6 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml @@ -0,0 +1,7 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# moe_load_balancer settings for TRTLLM based on: +# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer +num_slots: 288 +layer_updates_per_iter: 2 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml new file mode 100644 index 0000000000..5bbc66bd69 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml @@ -0,0 +1,35 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +backend: pytorch + +# WideEP related settings +moe_backend: WideEP +# moe_max_num_tokens will default to max_num_tokens if left unspecified. +# +# If you want to set this value explicitly, one recommendation is below: +# moe_max_num_tokens = max_batch_size * moe_expert_parallel_size +# 4096 = 256 * 16 +# moe_max_num_tokens: 4096 +moe_load_balancer: /mnt/engine_configs/eplb.yaml +tensor_parallel_size: 16 +moe_expert_parallel_size: 16 + +enable_attention_dp: true +max_batch_size: 256 +max_num_tokens: 256 +max_seq_len: 8448 +kv_cache_config: + free_gpu_memory_fraction: 0.7 +use_cuda_graph: true +cuda_graph_padding_enabled: true +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml new file mode 100644 index 0000000000..ac7fc7e8f6 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml @@ -0,0 +1,59 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +backend: pytorch + +# WideEP related settings +moe_backend: WideEP +moe_load_balancer: /mnt/engine_configs/eplb.yaml + +# TP/EP/PP/DP +tensor_parallel_size: 16 +moe_expert_parallel_size: 16 +pipeline_parallel_size: 1 +enable_attention_dp: true + +max_batch_size: 256 +max_num_tokens: 256 +# 8448 = 8192 ISL + 256 OSL +max_seq_len: 8448 + +kv_cache_config: + # With dp attention disabled: high free_gpu_memory_fraction is fine. + # free_gpu_memory_fraction: 0.85 + # With dp attention enabled: large ISL at high concurrency may need + # free_gpu_memory_fraction low to have enough available memory. + free_gpu_memory_fraction: 0.30 + +# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 +# NOTE: overlap_scheduler enabled by default since this commit and changed +# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': +# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 +disable_overlap_scheduler: false +use_cuda_graph: true +cuda_graph_padding_enabled: true +# NOTE: For larger max batch size, you may want to add larger cuda graph +# batch sizes below to match. +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +print_iter_log: true +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml new file mode 100644 index 0000000000..06968a3a76 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml @@ -0,0 +1,41 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +backend: pytorch + +# WideEP related settings +moe_backend: WideEP +moe_load_balancer: /mnt/engine_configs/eplb.yaml + +# TP/EP/PP/DP +tensor_parallel_size: 16 +moe_expert_parallel_size: 16 +pipeline_parallel_size: 1 +enable_attention_dp: true + +max_batch_size: 1 +max_num_tokens: 8192 +max_seq_len: 8192 + +kv_cache_config: + free_gpu_memory_fraction: 0.75 + +# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 +# NOTE: overlap_scheduler enabled by default since this commit and changed +# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': +# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 +disable_overlap_scheduler: true +print_iter_log: true +# NOTE: This dtype must match in both prefill/decode configs +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh new file mode 100755 index 0000000000..5a632551b9 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh @@ -0,0 +1,75 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# This is one of the only variables that must be set currently, most of the rest may +# just work out of the box if following the steps in the README. +IMAGE="${IMAGE:-""}" + +# Set to mount current host directory to /mnt inside the container as an example, +# but you may freely customize the mounts based on your cluster. A common practice +# is to mount paths to NFS storage for common scripts, model weights, etc. +# NOTE: This can be a comma separated list of multiple mounts as well. +DEFAULT_MOUNT="${PWD}:/mnt" +MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}" + +# Example values, assuming 4 nodes with 4 GPUs on each node, such as 4xGB200 nodes. +# For 8xH100 nodes as an example, you may set this to 2 nodes x 8 gpus/node instead. +NUM_NODES=${NUM_NODES:-4} +NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4} + +export ENGINE_CONFIG="${ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_agg.yaml}" + +# Automate settings of certain variables for convenience, but you are free +# to manually set these for more control as well. +ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)" +export HEAD_NODE="${SLURMD_NODENAME}" +export HEAD_NODE_IP="$(hostname -i)" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" + +if [[ -z ${IMAGE} ]]; then + echo "ERROR: You need to set the IMAGE environment variable to the " \ + "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \ + "See how to build one from source here: " \ + "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker" + exit 1 +fi + +# NOTE: Output streamed to stdout for ease of understanding the example, but +# in practice you would probably set `srun --output ... --error ...` to pipe +# the stdout/stderr to files. +echo "Launching frontend services in background." +srun \ + --overlap \ + --container-image "${IMAGE}" \ + --container-mounts "${MOUNTS}" \ + --verbose \ + --label \ + -A "${ACCOUNT}" \ + -J "${ACCOUNT}-dynamo.trtllm" \ + --nodelist "${HEAD_NODE}" \ + --nodes 1 \ + --jobid "${SLURM_JOB_ID}" \ + /mnt/start_frontend_services.sh & + +# NOTE: Output streamed to stdout for ease of understanding the example, but +# in practice you would probably set `srun --output ... --error ...` to pipe +# the stdout/stderr to files. +echo "Launching multi-node worker in background." +# No --task for the worker defaults to aggregated mode +TASK="" \ +srun \ + --mpi pmix \ + --oversubscribe \ + --container-image "${IMAGE}" \ + --container-mounts "${MOUNTS}" \ + --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \ + --verbose \ + --label \ + -A "${ACCOUNT}" \ + -J "${ACCOUNT}-dynamo.trtllm" \ + --nodes "${NUM_NODES}" \ + --ntasks-per-node "${NUM_GPUS_PER_NODE}" \ + --jobid "${SLURM_JOB_ID}" \ + /mnt/start_trtllm_worker.sh & diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh new file mode 100755 index 0000000000..32cb4993a9 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh @@ -0,0 +1,94 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# This is one of the only variables that must be set currently, most of the rest may +# just work out of the box if following the steps in the README. +IMAGE="${IMAGE:-""}" + +# Set to mount current host directory to /mnt inside the container as an example, +# but you may freely customize the mounts based on your cluster. A common practice +# is to mount paths to NFS storage for common scripts, model weights, etc. +# NOTE: This can be a comma separated list of multiple mounts as well. +DEFAULT_MOUNT="${PWD}:/mnt" +MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}" + +NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4} + +NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4} +PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_prefill.yaml}" + +NUM_DECODE_NODES=${NUM_DECODE_NODES:-4} +DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_decode.yaml}" + +# Automate settings of certain variables for convenience, but you are free +# to manually set these for more control as well. +ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)" +export HEAD_NODE="${SLURMD_NODENAME}" +export HEAD_NODE_IP="$(hostname -i)" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" + +if [[ -z ${IMAGE} ]]; then + echo "ERROR: You need to set the IMAGE environment variable to the " \ + "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \ + "See how to build one from source here: " \ + "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker" + exit 1 +fi + +# NOTE: Output streamed to stdout for ease of understanding the example, but +# in practice you would probably set `srun --output ... --error ...` to pipe +# the stdout/stderr to files. +echo "Launching frontend services in background." +srun \ + --overlap \ + --container-image "${IMAGE}" \ + --container-mounts "${MOUNTS}" \ + --verbose \ + --label \ + -A "${ACCOUNT}" \ + -J "${ACCOUNT}-dynamo.trtllm" \ + --nodelist "${HEAD_NODE}" \ + --nodes 1 \ + --jobid "${SLURM_JOB_ID}" \ + /mnt/start_frontend_services.sh & + +# NOTE: Output streamed to stdout for ease of understanding the example, but +# in practice you would probably set `srun --output ... --error ...` to pipe +# the stdout/stderr to files. +echo "Launching multi-node prefill worker in background." +TASK=prefill \ +ENGINE_CONFIG=${PREFILL_ENGINE_CONFIG} \ +srun \ + --mpi pmix \ + --oversubscribe \ + --container-image "${IMAGE}" \ + --container-mounts "${MOUNTS}" \ + --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \ + --verbose \ + --label \ + -A "${ACCOUNT}" \ + -J "${ACCOUNT}-dynamo.trtllm" \ + --nodes "${NUM_PREFILL_NODES}" \ + --ntasks-per-node "${NUM_GPUS_PER_NODE}" \ + --jobid "${SLURM_JOB_ID}" \ + /mnt/start_trtllm_worker.sh & + +echo "Launching multi-node decode worker in background." +TASK=decode \ +ENGINE_CONFIG=${DECODE_ENGINE_CONFIG} \ +srun \ + --mpi pmix \ + --oversubscribe \ + --container-image "${IMAGE}" \ + --container-mounts "${MOUNTS}" \ + --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \ + --verbose \ + --label \ + -A "${ACCOUNT}" \ + -J "${ACCOUNT}-dynamo.trtllm" \ + --nodes "${NUM_DECODE_NODES}" \ + --ntasks-per-node "${NUM_GPUS_PER_NODE}" \ + --jobid "${SLURM_JOB_ID}" \ + /mnt/start_trtllm_worker.sh & diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh new file mode 100755 index 0000000000..0d1b588904 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh @@ -0,0 +1,16 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# Start NATS +nats-server -js & + +# Start etcd +etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd & + +# Wait for NATS/etcd to startup +sleep 3 + +# Start OpenAI Frontend which will dynamically discover workers when they startup +# NOTE: This is a blocking call. +dynamo-run in=http out=dyn --http-port 8000 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh new file mode 100755 index 0000000000..257b3b1127 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +if [[ -z ${MODEL_PATH} ]]; then + echo "ERROR: MODEL_PATH was not set." + echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \ + "downloaded path to the model weights. Since Deepseek R1 is large, it is " \ + "recommended to pre-download them to a shared location and provide the path." + exit 1 +fi + +if [[ -z ${SERVED_MODEL_NAME} ]]; then + echo "WARNING: SERVED_MODEL_NAME was not set. It will be derived from MODEL_PATH." +fi + + + +if [[ -z ${ENGINE_CONFIG} ]]; then + echo "ERROR: ENGINE_CONFIG was not set." + echo "ERROR: ENGINE_CONFIG must be set to a valid Dynamo+TRTLLM engine config file." + exit 1 +fi + +EXTRA_ARGS="" +if [[ -n ${TASK} ]]; then + EXTRA_ARGS+="--task ${TASK}" +fi + +# NOTE: When this script is run directly from srun, the environment variables +# for TRTLLM KV cache are not set. So we need to set them here. +# Related issue: https://github.com/ai-dynamo/dynamo/issues/1743 +if [[ -z ${TRTLLM_USE_UCX_KVCACHE} ]] && [[ -z ${TRTLLM_USE_NIXL_KVCACHE} ]]; then + export TRTLLM_USE_UCX_KVCACHE=1 +fi + +# NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM +# worker and registers itself with the runtime. It is currently easier to wrap +# this standalone script with `trtllm-llmapi-launch` for MPI handling purposes, +# but this may be refactored into 'dynamo serve' in the future. +trtllm-llmapi-launch \ + python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py \ + --model-path "${MODEL_PATH}" \ + --model-name "${SERVED_MODEL_NAME}" \ + --extra-engine-args "${ENGINE_CONFIG}" \ + ${EXTRA_ARGS} diff --git a/examples/tensorrt_llm_sd/configs/disagg.yaml b/examples/tensorrt_llm_sd/configs/disagg.yaml new file mode 100644 index 0000000000..454e1640e6 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/disagg.yaml @@ -0,0 +1,48 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: round-robin + +TensorRTLLMWorker: + # Path to disk model or HuggingFace model identifier to load + model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Name to serve the model under + served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/engine_configs/decode_config.yaml" + enable-disagg: true + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 1 + +TensorRTLLMPrefillWorker: + # Path to disk model or HuggingFace model identifier to load + model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/engine_configs/prefill_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 1 + diff --git a/examples/tensorrt_llm_sd/configs/disagg_router.yaml b/examples/tensorrt_llm_sd/configs/disagg_router.yaml new file mode 100644 index 0000000000..faae7f65a3 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/disagg_router.yaml @@ -0,0 +1,47 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: kv + +TensorRTLLMWorker: + # Path to disk model or HuggingFace model identifier to load + model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Name to serve the model under + served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/engine_configs/decode_config.yaml" + enable-disagg: true + router: kv + ServiceArgs: + workers: 1 + resources: + gpu: 1 + +TensorRTLLMPrefillWorker: + # Path to disk model or HuggingFace model identifier to load + model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/engine_configs/prefill_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 1 \ No newline at end of file diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml new file mode 100644 index 0000000000..02b5cd8463 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml @@ -0,0 +1,31 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +tensor_parallel_size: 1 +moe_expert_parallel_size: 1 +enable_attention_dp: false +max_num_tokens: 8192 +max_batch_size: 16 +trust_remote_code: true +backend: pytorch +enable_chunked_prefill: true + +kv_cache_config: + free_gpu_memory_fraction: 0.95 + +# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 +# NOTE: overlap_scheduler enabled by default since this commit and changed +# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': +# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 +use_cuda_graph: true diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml new file mode 100644 index 0000000000..eb943fd6e7 --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml @@ -0,0 +1,27 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +tensor_parallel_size: 1 +moe_expert_parallel_size: 1 +enable_attention_dp: false +max_num_tokens: 8192 +max_batch_size: 16 +trust_remote_code: true +backend: pytorch +enable_chunked_prefill: true +disable_overlap_scheduler: false +use_cuda_graph: true +kv_cache_config: + free_gpu_memory_fraction: 0.95 + diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml new file mode 100644 index 0000000000..5dee9e653d --- /dev/null +++ b/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml @@ -0,0 +1,28 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +tensor_parallel_size: 1 +moe_expert_parallel_size: 1 +enable_attention_dp: false +max_num_tokens: 8192 +max_batch_size: 16 +trust_remote_code: true +backend: pytorch +enable_chunked_prefill: true +# Overlap scheduler not currently supported in prefill only workers. +disable_overlap_scheduler: true +use_cuda_graph: false + +kv_cache_config: + free_gpu_memory_fraction: 0.95 diff --git a/examples/tensorrt_llm_sd/graphs/agg.py b/examples/tensorrt_llm_sd/graphs/agg.py new file mode 100644 index 0000000000..e79f5f315c --- /dev/null +++ b/examples/tensorrt_llm_sd/graphs/agg.py @@ -0,0 +1,19 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from components.frontend import Frontend +from components.worker import TensorRTLLMWorker + +Frontend.link(TensorRTLLMWorker) diff --git a/examples/tensorrt_llm_sd/graphs/disagg.py b/examples/tensorrt_llm_sd/graphs/disagg.py new file mode 100644 index 0000000000..58bde05d9a --- /dev/null +++ b/examples/tensorrt_llm_sd/graphs/disagg.py @@ -0,0 +1,20 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from components.frontend import Frontend +from components.prefill_worker import TensorRTLLMPrefillWorker +from components.worker import TensorRTLLMWorker + +Frontend.link(TensorRTLLMWorker).link(TensorRTLLMPrefillWorker) From bc39d0e6d7bb06070a14243cac8dc3a46e547e31 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Tue, 8 Jul 2025 10:12:35 -0700 Subject: [PATCH 03/20] llama4+eagle3 configuration --- .../configs/{deepseek_r1 => llama4_eagle3}/agg.yaml | 12 ++++-------- .../{deepseek_r1 => llama4_eagle3}/disagg.yaml | 0 .../engine_configs/agg_config.yaml | 0 .../engine_configs/decode_config.yaml | 0 .../engine_configs/prefill_config.yaml | 0 .../mtp/engine_configs/agg_config.yaml | 0 .../mtp/engine_configs/decode_config.yaml | 0 .../mtp/engine_configs/prefill_config.yaml | 0 .../{deepseek_r1 => llama4_eagle3}/mtp/mtp_agg.yaml | 12 ++++-------- .../mtp/mtp_disagg.yaml | 0 .../multinode/README.md | 0 .../multinode/engine_configs/dep16_agg.yaml | 0 .../multinode/engine_configs/eplb.yaml | 0 .../multinode/engine_configs/wide_ep_agg.yaml | 0 .../multinode/engine_configs/wide_ep_decode.yaml | 0 .../multinode/engine_configs/wide_ep_prefill.yaml | 0 .../multinode/srun_aggregated.sh | 0 .../multinode/srun_disaggregated.sh | 0 .../multinode/start_frontend_services.sh | 0 .../multinode/start_trtllm_worker.sh | 0 20 files changed, 8 insertions(+), 16 deletions(-) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/agg.yaml (69%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/disagg.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/engine_configs/agg_config.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/engine_configs/decode_config.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/engine_configs/prefill_config.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/engine_configs/agg_config.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/engine_configs/decode_config.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/engine_configs/prefill_config.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/mtp_agg.yaml (71%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/mtp/mtp_disagg.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/README.md (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/dep16_agg.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/eplb.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/wide_ep_agg.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/wide_ep_decode.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/engine_configs/wide_ep_prefill.yaml (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/srun_aggregated.sh (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/srun_disaggregated.sh (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/start_frontend_services.sh (100%) rename examples/tensorrt_llm_sd/configs/{deepseek_r1 => llama4_eagle3}/multinode/start_trtllm_worker.sh (100%) diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml similarity index 69% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml index f7cec35e7d..3ac3facedd 100644 --- a/examples/tensorrt_llm_sd/configs/deepseek_r1/agg.yaml +++ b/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml @@ -15,19 +15,15 @@ Frontend: # This is the client-facing model name, you can set this to anything you'd like. - served_model_name: "nvidia/DeepSeek-R1-FP4" + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" endpoint: dynamo.TensorRTLLMWorker.generate port: 8000 router: round-robin TensorRTLLMWorker: - served_model_name: "nvidia/DeepSeek-R1-FP4" - # NOTE: FP4 only supported starting with Blackwell GPUs. - # https://huggingface.co/nvidia/DeepSeek-R1-FP4 - # You can also specify the full path to locally downloaded weights - # instead of a HuggingFace ID here. - model-path: "nvidia/DeepSeek-R1-FP4" - extra-engine-args: "configs/deepseek_r1/engine_configs/agg_config.yaml" + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + extra-engine-args: "configs/llama4_eagle3/engine_configs/agg_config.yaml" router: round-robin ServiceArgs: workers: 1 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/disagg.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/agg_config.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/decode_config.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/engine_configs/prefill_config.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/agg_config.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/decode_config.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml similarity index 71% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml index c51abf9d95..626ca27953 100644 --- a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_agg.yaml +++ b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml @@ -14,21 +14,17 @@ # limitations under the License. Frontend: - served_model_name: "nvidia/DeepSeek-R1-FP4" + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" endpoint: dynamo.TensorRTLLMWorker.generate port: 8000 router: round-robin TensorRTLLMWorker: - served_model_name: "nvidia/DeepSeek-R1-FP4" - # NOTE: FP4 only supported starting with Blackwell GPUs. - # https://huggingface.co/nvidia/DeepSeek-R1-FP4 - # You can also specify the full path to locally downloaded weights - # instead of a HuggingFace ID here. - model-path: "nvidia/DeepSeek-R1-FP4" + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/agg_config.yaml" + extra-engine-args: "configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml" router: round-robin ServiceArgs: workers: 1 diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/mtp/mtp_disagg.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/README.md rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/dep16_agg.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/eplb.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_agg.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_decode.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/engine_configs/wide_ep_prefill.yaml rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_aggregated.sh rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/srun_disaggregated.sh rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_frontend_services.sh rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh diff --git a/examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh similarity index 100% rename from examples/tensorrt_llm_sd/configs/deepseek_r1/multinode/start_trtllm_worker.sh rename to examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh From 4a0b8c64b00a6d5882324bdbeda3cfc52fc4ab14 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Tue, 8 Jul 2025 11:59:20 -0700 Subject: [PATCH 04/20] Test --- examples/tensorrt_llm_sd/configs/disagg.yaml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/examples/tensorrt_llm_sd/configs/disagg.yaml b/examples/tensorrt_llm_sd/configs/disagg.yaml index 454e1640e6..73c202f4be 100644 --- a/examples/tensorrt_llm_sd/configs/disagg.yaml +++ b/examples/tensorrt_llm_sd/configs/disagg.yaml @@ -14,16 +14,16 @@ # limitations under the License. Frontend: - served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + served_model_name: nvidia/Llama-4-Maverick-17B-128E-Eagle3 endpoint: dynamo.TensorRTLLMWorker.generate port: 8000 router: round-robin TensorRTLLMWorker: # Path to disk model or HuggingFace model identifier to load - model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + model-path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 # Name to serve the model under - served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + served_model_name: nvidia/Llama-4-Maverick-17B-128E-Eagle3 # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. extra-engine-args: "configs/engine_configs/decode_config.yaml" @@ -36,7 +36,7 @@ TensorRTLLMWorker: TensorRTLLMPrefillWorker: # Path to disk model or HuggingFace model identifier to load - model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + model-path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. extra-engine-args: "configs/engine_configs/prefill_config.yaml" From e889649d14cfe0fd447fcc3e9ab13a24d2286461 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Tue, 8 Jul 2025 19:18:38 -0700 Subject: [PATCH 05/20] Modified Workflow --- .../eagle/engine_configs/agg_config.yaml | 50 +++++++++++++++++ .../eagle/engine_configs/decode_config.yaml | 53 +++++++++++++++++++ .../eagle/engine_configs/prefill_config.yaml | 37 +++++++++++++ .../configs/llama4/eagle/mtp_agg.yaml | 31 +++++++++++ .../configs/llama4/eagle/mtp_disagg.yaml | 52 ++++++++++++++++++ 5 files changed, 223 insertions(+) create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml create mode 100644 examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml new file mode 100644 index 0000000000..633d630633 --- /dev/null +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -0,0 +1,50 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NOTE: FP4 only supported starting with Blackwell GPUs. +# https://huggingface.co/nvidia/DeepSeek-R1-FP4 +# You can also specify the full path to locally downloaded weights +# instead of a HuggingFace ID here. + +backend: pytorch +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +# enable_attention_dp: true +max_batch_size: 256 +# 8448 = 8192 ISL + 256 OSL +max_num_tokens: 8448 +max_seq_len: 8448 +kv_cache_config: + free_gpu_memory_fraction: 0.30 + +# Enable the MTP(Multi-Token Prediction) in the model engine +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +use_cuda_graph: true +cuda_graph_padding_enabled: true +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +print_iter_log: true +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml new file mode 100644 index 0000000000..fed64bcb22 --- /dev/null +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml @@ -0,0 +1,53 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NOTE: FP4 only supported starting with Blackwell GPUs. +# https://huggingface.co/nvidia/DeepSeek-R1-FP4 +# You can also specify the full path to locally downloaded weights +# instead of a HuggingFace ID here. + +backend: pytorch +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +# enable_attention_dp: false +max_batch_size: 256 +# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula: +# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1) +# This is a known issue in TensorRT-LLM and will be resolved in the next release. +max_num_tokens: 512 +# 8704 = 8192 ISL + 512 OSL +max_seq_len: 8704 +kv_cache_config: + free_gpu_memory_fraction: 0.85 + +# Enable the MTP(Multi-Token Prediction) in decode model engine +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +use_cuda_graph: true +cuda_graph_padding_enabled: true +cuda_graph_batch_sizes: +- 1 +- 2 +- 4 +- 8 +- 16 +- 32 +- 64 +- 128 +- 256 +print_iter_log: true +kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml new file mode 100644 index 0000000000..6dd4bca5ed --- /dev/null +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml @@ -0,0 +1,37 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NOTE: FP4 only supported starting with Blackwell GPUs. +# https://huggingface.co/nvidia/DeepSeek-R1-FP4 +# You can also specify the full path to locally downloaded weights +# instead of a HuggingFace ID here. + +backend: pytorch +tensor_parallel_size: 4 +moe_expert_parallel_size: 4 +# enable_attention_dp: true +max_batch_size: 1 +max_num_tokens: 8192 +max_seq_len: 8192 +kv_cache_config: + free_gpu_memory_fraction: 0.75 +print_iter_log: true +kv_cache_dtype: fp8 +disable_overlap_scheduler: true + +# Enable the MTP(Multi-Token Prediction) in the prefill model engine +speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 diff --git a/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml new file mode 100644 index 0000000000..6a64336101 --- /dev/null +++ b/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml @@ -0,0 +1,31 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + # This is the client-facing model name, you can set this to anything you'd like. + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: round-robin + +TensorRTLLMWorker: + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + extra-engine-args: "configs/llama4/engine_configs/agg_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 8 diff --git a/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml new file mode 100644 index 0000000000..72d3ce6f29 --- /dev/null +++ b/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml @@ -0,0 +1,52 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +Frontend: + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + endpoint: dynamo.TensorRTLLMWorker.generate + port: 8000 + router: round-robin + +TensorRTLLMWorker: + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + # NOTE: FP4 only supported starting with Blackwell GPUs. + # https://huggingface.co/nvidia/DeepSeek-R1-FP4 + # You can also specify the full path to locally downloaded weights + # instead of a HuggingFace ID here. + model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/llama4/eagle/engine_configs/decode_config.yaml" + router: round-robin + enable-disagg: true + ServiceArgs: + workers: 1 + resources: + gpu: 4 + +TensorRTLLMPrefillWorker: + # NOTE: FP4 only supported starting with Blackwell GPUs. + # https://huggingface.co/nvidia/DeepSeek-R1-FP4 + # You can also specify the full path to locally downloaded weights + # instead of a HuggingFace ID here. + model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. + # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. + extra-engine-args: "configs/llama4/eagle/engine_configs/prefill_config.yaml" + router: round-robin + ServiceArgs: + workers: 1 + resources: + gpu: 4 From 7b4c32a00fe94b4b8637fa916d12f9027ed97836 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Tue, 8 Jul 2025 19:26:50 -0700 Subject: [PATCH 06/20] Streamlined Example --- examples/tensorrt_llm_sd/README.md | 352 ---------------- examples/tensorrt_llm_sd/__init__.py | 14 - examples/tensorrt_llm_sd/common/__init__.py | 0 .../tensorrt_llm_sd/common/base_engine.py | 389 ------------------ examples/tensorrt_llm_sd/common/parser.py | 62 --- examples/tensorrt_llm_sd/common/protocol.py | 104 ----- .../tensorrt_llm_sd/components/frontend.py | 119 ------ .../components/prefill_worker.py | 75 ---- examples/tensorrt_llm_sd/components/worker.py | 115 ------ examples/tensorrt_llm_sd/configs/agg.yaml | 34 -- .../tensorrt_llm_sd/configs/agg_router.yaml | 34 -- examples/tensorrt_llm_sd/configs/disagg.yaml | 48 --- .../configs/disagg_router.yaml | 47 --- .../configs/engine_configs/agg_config.yaml | 31 -- .../configs/engine_configs/decode_config.yaml | 27 -- .../engine_configs/prefill_config.yaml | 28 -- .../configs/llama4_eagle3/agg.yaml | 31 -- .../configs/llama4_eagle3/disagg.yaml | 49 --- .../engine_configs/agg_config.yaml | 54 --- .../engine_configs/decode_config.yaml | 55 --- .../engine_configs/prefill_config.yaml | 37 -- .../mtp/engine_configs/agg_config.yaml | 50 --- .../mtp/engine_configs/decode_config.yaml | 53 --- .../mtp/engine_configs/prefill_config.yaml | 37 -- .../configs/llama4_eagle3/mtp/mtp_agg.yaml | 32 -- .../configs/llama4_eagle3/mtp/mtp_disagg.yaml | 52 --- .../configs/llama4_eagle3/multinode/README.md | 275 ------------- .../multinode/engine_configs/dep16_agg.yaml | 27 -- .../multinode/engine_configs/eplb.yaml | 7 - .../multinode/engine_configs/wide_ep_agg.yaml | 35 -- .../engine_configs/wide_ep_decode.yaml | 59 --- .../engine_configs/wide_ep_prefill.yaml | 41 -- .../multinode/srun_aggregated.sh | 75 ---- .../multinode/srun_disaggregated.sh | 94 ----- .../multinode/start_frontend_services.sh | 16 - .../multinode/start_trtllm_worker.sh | 46 --- examples/tensorrt_llm_sd/graphs/agg.py | 19 - examples/tensorrt_llm_sd/graphs/disagg.py | 20 - 38 files changed, 2643 deletions(-) delete mode 100644 examples/tensorrt_llm_sd/README.md delete mode 100644 examples/tensorrt_llm_sd/__init__.py delete mode 100644 examples/tensorrt_llm_sd/common/__init__.py delete mode 100644 examples/tensorrt_llm_sd/common/base_engine.py delete mode 100644 examples/tensorrt_llm_sd/common/parser.py delete mode 100644 examples/tensorrt_llm_sd/common/protocol.py delete mode 100644 examples/tensorrt_llm_sd/components/frontend.py delete mode 100644 examples/tensorrt_llm_sd/components/prefill_worker.py delete mode 100644 examples/tensorrt_llm_sd/components/worker.py delete mode 100644 examples/tensorrt_llm_sd/configs/agg.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/agg_router.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/disagg.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/disagg_router.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml delete mode 100644 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml delete mode 100755 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh delete mode 100755 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh delete mode 100755 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh delete mode 100755 examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh delete mode 100644 examples/tensorrt_llm_sd/graphs/agg.py delete mode 100644 examples/tensorrt_llm_sd/graphs/disagg.py diff --git a/examples/tensorrt_llm_sd/README.md b/examples/tensorrt_llm_sd/README.md deleted file mode 100644 index f844a56d94..0000000000 --- a/examples/tensorrt_llm_sd/README.md +++ /dev/null @@ -1,352 +0,0 @@ - - -# LLM Deployment Examples using TensorRT-LLM - -This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM. - -## Use the Latest Release - -We recommend using the latest stable release of dynamo to avoid breaking changes: - -[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest) - -You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with: - -```bash -git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) -``` - -## Deployment Architectures - -See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. -Note that this TensorRT-LLM version does not support all the options yet. - -Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving. - -## Getting Started - -1. Choose a deployment architecture based on your requirements -2. Configure the components as needed -3. Deploy using the provided scripts - -### Prerequisites - -Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml) -```bash -docker compose -f deploy/metrics/docker-compose.yml up -d -``` - -### Build docker - -```bash -# TensorRT-LLM uses git-lfs, which needs to be installed in advance. -apt-get update && apt-get -y install git git-lfs - -# On an x86 machine: -./container/build.sh --framework tensorrtllm - -# On an ARM machine: -./container/build.sh --framework tensorrtllm --platform linux/arm64 - -# Build the container with the default experimental TensorRT-LLM commit -# WARNING: This is for experimental feature testing only. -# The container should not be used in a production environment. -./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit -``` - -### Run container - -``` -./container/run.sh --framework tensorrtllm -it -``` -## Run Deployment - -This figure shows an overview of the major components to deploy: - - - -``` - -+------+ +-----------+ +------------------+ +---------------+ -| HTTP |----->| processor |----->| Worker |------------>| Prefill | -| |<-----| |<-----| |<------------| Worker | -+------+ +-----------+ +------------------+ +---------------+ - | ^ | - query best | | return | publish kv events - worker | | worker_id v - | | +------------------+ - | +---------| kv-router | - +------------->| | - +------------------+ - -``` - -Note: The above architecture illustrates all the components. The final components -that get spawned depend upon the chosen graph. - -### Example architectures - -#### Aggregated serving -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml -``` - -#### Aggregated serving with KV Routing -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve graphs.agg:Frontend -f ./configs/agg_router.yaml -``` - -#### Disaggregated serving -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml -``` - -#### Disaggregated serving with KV Routing -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve graphs.disagg:Frontend -f ./configs/disagg_router.yaml -``` - -#### Aggregated serving with Multi-Token Prediction (MTP) and DeepSeek R1 -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml -``` - -Notes: -- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script. - - Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit` - -- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. -- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates. - -#### Multi-Node Disaggregated Serving - -In the following example, we will demonstrate how to run a Disaggregated Serving -deployment across multiple nodes. For simplicity, we will demonstrate how to -deploy a single Decode worker on one node, and a single Prefill worker on the other node. -However, the instance counts, TP sizes, other configs, and responsibilities of each node -can be customized and deployed in similar ways. - -For example, to deploy Deepseek R1, you could replace the referenced example -configs (`configs/agg.yaml`, `configs/disagg.yaml`) with corresponding Deepseek R1 -example configs (`configs/deepseek_r1/agg.yaml`, `configs/deepseek_r1/disagg.yaml`). -You can find the example Deepseek R1 configs for GB200 -[here](configs/deepseek_r1), but the config settings can be customized for testing -other hardware configurations or parallelism strategies. - -This "multi-node" example demonstrates how to generally connect dynamo workers from -different nodes, but for simplicity, each worker individually fits on a single node. -For details on how to launch a worker that spans multiple nodes due to sheer model -size, or for features like large scale expert parallelism, see the -[multinode worker example](configs/deepseek_r1/multinode). - -##### Head Node - -Start nats/etcd: -```bash -# NATS data persisted to /tmp/nats/jetstream by default -nats-server -js & - -# Persist data to /tmp/etcd, otherwise defaults to ${PWD}/default.etcd if left unspecified -etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd & - -# NOTE: Clearing out the etcd and nats jetstream data directories across runs -# helps to guarantee a clean and reproducible results. -``` - -Launch graph of Frontend and TensorRTLLMWorker (decode) on head node: - -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve graphs.agg:Frontend -f ./configs/disagg.yaml & -``` - -Notes: -- The aggregated graph (`graphs.agg`) is chosen here because it also describes - our desired deployment settings for the head node: launching the utility components - (Frontend, Processor), and only the decode worker (TensorRTLLMWorker configured with - `remote-prefill` enabled). We plan to launch the `TensorRTLLMPrefillWorker` - independently on a separate node in the next step of this demonstration. - You are free to customize the graph and configuration of components launched on - each node. -- The disaggregated config `configs/disagg.yaml` is intentionally chosen here as a - single source of truth to be used for deployments on all of our nodes, describing - the configurations for all of our components, including both decode and prefill - workers, but can be customized based on your deployment needs. - -##### Worker Node(s) - -Set environment variables pointing at the etcd/nats endpoints on the head node -so the Dynamo Distributed Runtime can orchestrate communication and -discoverability between the head node and worker nodes: -```bash -# if not head node -export HEAD_NODE_IP="" -export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" -export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" -``` - -Deploy a Prefill worker: -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker & -``` - -Now you have a 2-node deployment with 1 Decode worker on the head node, and 1 Prefill worker on a worker node! - -##### Additional Notes for Multi-Node Deployments - -Notes: -- To include a router in this deployment, change the graph to one that includes the router, such as `graphs.agg_router`, - and change the config to one that includes the router, such as `configs/disagg_router.yaml` -- This step is assuming you're disaggregated serving and planning to launch prefill workers on separate nodes. - Howerver, for an aggregated deployment with additional aggregated worker replicas on other nodes, this step - remains mostly the same. The primary difference between aggregation and disaggregation for this step is - whether or not the `TensorRTLLMWorker` is configured to do `remote-prefill` or not in the config file - (ex: `configs/disagg.yaml` vs `configs/agg.yaml`). -- To apply the same concept for launching additional decode workers on worker nodes, you can - directly start them, similar to the prefill worker step above: - ```bash - # Example: deploy decode worker only - cd /workspace/examples/tensorrt_llm - dynamo serve components.worker:TensorRTLLMWorker -f ./configs/disagg.yaml --service-name TensorRTLLMWorker & - ``` -- If you see an error about MPI Spawn failing during TRTLLM Worker initialziation on a Slurm-based cluster, - try unsetting the following environment variables before launching the TRTLLM worker. If you intend to - run other slurm-based commands or processes on the same node after deploying the TRTLLM worker, you may - want to save these values into temporary variables and then restore them afterwards. - ```bash - # Workaround for error: `mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes` - unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST - ``` - -#### Multi-Node Disaggregated Serving with Multi-Token Prediction (MTP) and DeepSeek R1 - -Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations - -##### Head Node - -Start nats/etcd -```bash -nats-server -js & -etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd & -``` - -Launch graph of Frontend and TensorRTLLMWorker (decode) on head node: - -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_disagg.yaml & -``` - -##### Worker Node(s) - -Set environment variables pointing at the etcd/nats endpoints on the head node. -```bash -export HEAD_NODE_IP="" -export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" -export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" -``` - -Deploy a Prefill worker: -```bash -cd /workspace/examples/tensorrt_llm -dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/deepseek_r1/mtp/mtp_disagg.yaml --service-name TensorRTLLMPrefillWorker & -``` - -Notes: -- MTP is only available within the container built with the experimental TensorRT-LLM commit. Please add --use-default-experimental-tensorrtllm-commit to the arguments of the build.sh script. - - Example: `./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit` -- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. -- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates. - - -### Client - -See [client](../llm/README.md#client) section to learn how to send request to the deployment. - -NOTE: To send a request to a multi-node deployment, target the node which deployed the `Frontend` component. - -### Close deployment - -See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment. - -### Benchmarking - -To benchmark your deployment with GenAI-Perf, see this utility script, configuring the -`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh) - - -### KV Cache Transfer for Disaggregated Serving - -In disaggregated serving architectures, KV cache must be transferred between prefill and decode nodes. TensorRT-LLM supports two methods for this transfer: - -#### Default Method: UCX -By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode nodes. UCX provides high-performance communication optimized for GPU-to-GPU transfers. - -#### Experimental Method: NIXL -TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments. - -**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet. - -#### Using NIXL for KV Cache Transfer - -**Note:** NIXL backend for TensorRT-LLM is currently only supported on AMD64 (x86_64) architecture. If you're running on ARM64, you'll need to use the default UCX method for KV cache transfer. - -To enable NIXL for KV cache transfer in disaggregated serving: - -1. **Build the container with NIXL support:** - The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support. - - **Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):** - ```bash - rm -rf /tmp/trtllm_wheel - ``` - - **Build the container with NIXL support:** - ```bash - ./container/build.sh --framework tensorrtllm \ - --use-default-experimental-tensorrtllm-commit \ - --trtllm-use-nixl-kvcache-experimental - ``` - - **Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support. - -2. **Run the containerized environment:** - See [run container](#run-container) section to learn how to start the container image built in previous step. - -3. **Start the disaggregated service:** - See [disaggregated serving](#disaggregated-serving) to see how to start the deployment. - -4. **Send the request:** - See [client](#client) section to learn how to send the request to deployment. - -**Important:** Ensure that ETCD and NATS services are running before starting the service. - -The container will automatically configure the appropriate environment variables (`TRTLLM_USE_NIXL_KVCACHE=1`) when built with the NIXL flag. The same container image can be used to use UCX for KV cache transfer. -```bash -unset TRTLLM_USE_NIXL_KVCACHE -export TRTLLM_USE_UCX_KVCACHE=1 -``` - diff --git a/examples/tensorrt_llm_sd/__init__.py b/examples/tensorrt_llm_sd/__init__.py deleted file mode 100644 index 3159bfe656..0000000000 --- a/examples/tensorrt_llm_sd/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/examples/tensorrt_llm_sd/common/__init__.py b/examples/tensorrt_llm_sd/common/__init__.py deleted file mode 100644 index e69de29bb2..0000000000 diff --git a/examples/tensorrt_llm_sd/common/base_engine.py b/examples/tensorrt_llm_sd/common/base_engine.py deleted file mode 100644 index 3df95b490c..0000000000 --- a/examples/tensorrt_llm_sd/common/base_engine.py +++ /dev/null @@ -1,389 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import logging -from dataclasses import dataclass -from typing import Any, Optional - -from common.protocol import DisaggregatedTypeConverter, TRTLLMWorkerRequest -from tensorrt_llm import SamplingParams -from tensorrt_llm.llmapi.llm_utils import update_llm_args_with_extra_options -from tensorrt_llm.llmapi.tokenizer import tokenizer_factory -from tensorrt_llm.serve.openai_protocol import ( - DisaggregatedParams as OAIDisaggregatedParams, -) - -from dynamo.llm import get_tensorrtllm_engine, get_tensorrtllm_publisher -from dynamo.runtime import DistributedRuntime - -logger = logging.getLogger(__name__) - -logger.setLevel(logging.DEBUG) - -# Default buffer size for kv cache events. -DEFAULT_KV_EVENT_BUFFER_MAX_SIZE = 1024 - - -def parse_endpoint(endpoint: str) -> tuple[str, str, str]: - endpoint_str = endpoint.replace("dyn://", "", 1) - endpoint_parts = endpoint_str.split(".") - if len(endpoint_parts) != 3: - raise ValueError( - f"Invalid endpoint format: '{endpoint}'. " - "Expected 'dyn://namespace.component.endpoint' or 'namespace.component.endpoint'." - ) - - return (endpoint_parts[0], endpoint_parts[1], endpoint_parts[2]) - - -@dataclass -class BaseEngineConfig: - """Base engine configuration""" - - namespace: str - component: str - endpoint: str - model_path: str - served_model_name: Optional[str] = None - kv_block_size: int = 32 - extra_engine_args: str = "" - publish_events_and_metrics: bool = False - disaggregation_mode: str = "prefill_and_decode" - remote_prefill_endpoint: Optional[str] = None - lease_id: int = 0 - - def __str__(self) -> str: - return ( - f"Config(namespace={self.namespace}, " - f"component={self.component}, " - f"endpoint={self.endpoint}, " - f"model_path={self.model_path}, " - f"served_model_name={self.served_model_name}, " - f"kv_block_size={self.kv_block_size}, " - f"extra_engine_args={self.extra_engine_args}, " - f"publish_events_and_metrics={self.publish_events_and_metrics}, " - f"disaggregation_mode={self.disaggregation_mode}, " - f"remote_prefill_endpoint={self.remote_prefill_endpoint}, " - f"lease_id={self.lease_id})" - ) - - -class BaseTensorrtLLMEngine: - def __init__( - self, - config: BaseEngineConfig, - ): - self._config = config - self._prefill_client = None - self._llm_engine = None - self._llm_engine_context = None - self._llm_publisher = None - self._llm_publisher_context = None - self._runtime = None - self._first_generation = True - # Initialize default sampling params - self.default_sampling_params = SamplingParams() - - async def initialize(self, runtime: DistributedRuntime): - """Initialize the engine and prefill client if needed""" - self._runtime = runtime - - # Convert model path to Path object if it's a local path, otherwise keep as string - model_path = str(self._config.model_path) - - # Initialize the LLM engine - engine_args: dict[str, Any] = { - "model": model_path, - "tensor_parallel_size": 1, - "backend": "pytorch", - "skip_tokenizer_init": True, - } - - if self._config.extra_engine_args: - # TODO: Support extra engine args from json file as well. - engine_args = update_llm_args_with_extra_options( - engine_args, self._config.extra_engine_args - ) - # Update the model path in the config to the model path used by the engine. - self._config.model_path = str(engine_args["model"]) - if not self._config.model_path: - raise ValueError( - "Model specification is required. Present neither in the config nor in the extra engine args." - ) - - # Populate default sampling params from the model - tokenizer = tokenizer_factory(self._config.model_path) - self.default_sampling_params = SamplingParams() - self.default_sampling_params._setup(tokenizer) - self.default_sampling_params.stop = None - - if self._config.publish_events_and_metrics: - # 'event_buffer_max_size' is required to enable TRTLLM to publish kv cache events. - kv_cache_config: dict[str, Any] | Any = None - if "kv_cache_config" not in engine_args: - kv_cache_config = {} - kv_cache_config[ - "event_buffer_max_size" - ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE - else: - kv_cache_config = engine_args["kv_cache_config"] - if ( - hasattr(kv_cache_config, "event_buffer_max_size") - and not kv_cache_config.event_buffer_max_size - ): - kv_cache_config.event_buffer_max_size = ( - DEFAULT_KV_EVENT_BUFFER_MAX_SIZE - ) - elif ( - isinstance(kv_cache_config, dict) - and "event_buffer_max_size" not in kv_cache_config - ): - kv_cache_config[ - "event_buffer_max_size" - ] = DEFAULT_KV_EVENT_BUFFER_MAX_SIZE - engine_args["kv_cache_config"] = kv_cache_config - - # Enable iter perf stats by default if we are publishing events and metrics. - if not engine_args.get("enable_iter_perf_stats"): - engine_args["enable_iter_perf_stats"] = True - - # Only pytorch backend is supported for now to publish events and metrics. - if engine_args.get("backend") != "pytorch": - logging.error( - "Only pytorch backend is supported for now to publish events and metrics." - ) - raise RuntimeError( - "Only pytorch backend is supported for now to publish events and metrics. Hence, KV router is not supported." - ) - - logging.info(f"TRTLLM engine args: {engine_args}") - - # Get the engine using the asynccontextmanager - self._llm_engine_context = get_tensorrtllm_engine(engine_args) - if self._llm_engine_context is not None: - self._llm_engine = await self._llm_engine_context.__aenter__() - else: - raise RuntimeError("Failed to create LLM engine context") - - if ( - self._config.publish_events_and_metrics - and self._config.disaggregation_mode != "prefill" - ): - kv_listener = runtime.namespace(self._config.namespace).component( - self._config.component - ) - self._llm_publisher_context = get_tensorrtllm_publisher( - kv_listener, - self._llm_engine, - kv_listener, - self._config.lease_id, - self._config.kv_block_size, - ) - if self._llm_publisher_context is not None: - self._llm_publisher = await self._llm_publisher_context.__aenter__() - else: - raise RuntimeError("Failed to create LLM publisher context") - - # Initialize prefill client if in decode mode - if self._config.disaggregation_mode == "decode": - if self._config.remote_prefill_endpoint is None: - raise ValueError("remote_prefill_endpoint is required for decode mode") - logging.info( - f"Initializing remote prefill client for endpoint: {self._config.remote_prefill_endpoint}" - ) - ( - parsed_namespace, - parsed_component_name, - parsed_endpoint_name, - ) = parse_endpoint(self._config.remote_prefill_endpoint) - if self._runtime is not None: - self._prefill_client = ( - await self._runtime.namespace(parsed_namespace) - .component(parsed_component_name) - .endpoint(parsed_endpoint_name) - .client() - ) - else: - raise RuntimeError("Runtime not initialized") - - async def cleanup(self): - """Cleanup resources""" - if self._llm_publisher_context: - try: - await self._llm_publisher_context.__aexit__(None, None, None) - except Exception as e: - logging.error(f"Error during publisher cleanup: {e}") - finally: - self._llm_publisher = None - self._llm_publisher_context = None - - if self._llm_engine_context: - try: - await self._llm_engine_context.__aexit__(None, None, None) - except Exception as e: - logging.error(f"Error during engine cleanup: {e}") - finally: - self._llm_engine = None - self._llm_engine_context = None - - self._prefill_client = None - - async def remote_prefill(self, request: TRTLLMWorkerRequest): - """ - Send a prefill request to the remote prefill worker. - - Args: - request: The original request to be sent for prefill - - Returns: - The response from the remote prefill worker - - Raises: - ValueError: If prefill client is not initialized or multiple responses received - """ - prefill_request = request.model_copy(deep=True) - # TRTLLM requires max_tokens to be set for prefill requests. - prefill_request.stop_conditions.max_tokens = 1 - prefill_request.disaggregated_params = OAIDisaggregatedParams( - request_type="context_only" - ) - - if self._prefill_client is None: - raise ValueError("Prefill client not initialized") - try: - # TODO: Use smart KV router to determine which prefill worker to use. This would also require supporting publishing events for prefill workers. - remote_prefill_responses = [ - remote_prefill_response - async for remote_prefill_response in await self._prefill_client.round_robin( - prefill_request.model_dump_json() - ) - ] - except Exception as e: - raise ValueError(f"Error in remote prefill: {e}") - - if len(remote_prefill_responses) > 1: - raise ValueError( - "Prefill worker returned more than one response. This is currently not supported in remote prefill mode." - ) - - if len(remote_prefill_responses) == 0: - raise ValueError("No response received from remote prefill worker") - - remote_prefill_response = remote_prefill_responses[0] - return remote_prefill_response - - async def generate(self, request: TRTLLMWorkerRequest): - if self._llm_engine is None: - raise RuntimeError("Engine not initialized") - - if self._llm_publisher: - publishers_error = self._llm_publisher.check_error_queue() - if publishers_error: - raise publishers_error - - inputs = request.token_ids - - # Decode the disaggregated params from the request - disaggregated_params = DisaggregatedTypeConverter.to_llm_disaggregated_params( - request.disaggregated_params - ) - num_output_tokens_so_far = 0 - - if self._config.disaggregation_mode == "decode": - # Run prefill/context phase remotely if disaggregation mode is decode. - try: - prefill_result = await self.remote_prefill(request) - except Exception as e: - raise ValueError(f"Error in remote prefill: {e}") - - remote_prefill_response = prefill_result.data() - if ( - remote_prefill_response["finish_reason"] == "stop" - or remote_prefill_response["finish_reason"] == "error" - ): - yield remote_prefill_response - return - num_output_tokens_so_far = len(remote_prefill_response["token_ids"]) - - # Decode the disaggregated params from the remote prefill response - # Decode the disaggregated params from the remote prefill response - disaggregated_params = ( - DisaggregatedTypeConverter.to_llm_disaggregated_params( - OAIDisaggregatedParams( - **remote_prefill_response["disaggregated_params"] - ) - ) - ) - - # Send the first token response to the client - first_token_response = remote_prefill_response - first_token_response.pop("disaggregated_params") - yield first_token_response - - # Set the disaggregated params to generation_only for the rest of the generation - disaggregated_params.request_type = "generation_only" - - sampling_params = self.default_sampling_params - for key, value in request.sampling_options.model_dump().items(): - if not value: - continue - if hasattr(sampling_params, key): - setattr(sampling_params, key, value) - - max_tokens = request.stop_conditions.max_tokens - if max_tokens: - sampling_params.max_tokens = max_tokens - - ignore_eos = request.stop_conditions.ignore_eos - if ignore_eos: - sampling_params.ignore_eos = ignore_eos - - # TODO: Disable streaming for context only requests when adding disagg support - async for res in self._llm_engine.llm.generate_async( - inputs=inputs, - sampling_params=sampling_params, - disaggregated_params=disaggregated_params, - streaming=(self._config.disaggregation_mode != "prefill"), - ): - # TRTLLM engine needs to start generating tokens first before stats - # can be retrieved. - if self._first_generation and self._llm_publisher: - self._llm_publisher.start() - self._first_generation = False - - if res.finished and self._config.disaggregation_mode != "prefill": - yield {"finish_reason": "stop", "token_ids": []} - break - - if not res.outputs: - yield {"finish_reason": "error", "token_ids": []} - break - - output = res.outputs[0] - next_total_toks = len(output.token_ids) - out = {"token_ids": output.token_ids[num_output_tokens_so_far:]} - if output.finish_reason: - out["finish_reason"] = output.finish_reason - if output.stop_reason: - out["stop_reason"] = output.stop_reason - if self._config.disaggregation_mode == "prefill": - # Return the disaggregated params only when operating in prefill mode. - out[ - "disaggregated_params" - ] = DisaggregatedTypeConverter.to_oai_disaggregated_params( - output.disaggregated_params - ).model_dump() - - yield out - num_output_tokens_so_far = next_total_toks diff --git a/examples/tensorrt_llm_sd/common/parser.py b/examples/tensorrt_llm_sd/common/parser.py deleted file mode 100644 index 67bb230796..0000000000 --- a/examples/tensorrt_llm_sd/common/parser.py +++ /dev/null @@ -1,62 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - - -def parse_tensorrt_llm_args( - config_args, -) -> argparse.Namespace: - parser = argparse.ArgumentParser(description="A TensorRT-LLM Worker parser") - parser.add_argument( - "--extra-engine-args", - type=str, - default="", - help="Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine.", - ) - parser.add_argument( - "--model-path", - type=str, - default=None, - help="Path to disk model or HuggingFace model identifier to load.", - ) - parser.add_argument( - "--served_model_name", - type=str, - help="Name to serve the model under.", - ) - parser.add_argument( - "--router", - type=str, - choices=["random", "round-robin", "kv"], - default="random", - help="Router type to use for scheduling requests to workers", - ) - - parser.add_argument( - "--kv-block-size", - type=int, - default=32, - help="Number of tokens per KV block in TRTLLM worker. Default is 32 for pytorch backend.", - ) - - parser.add_argument( - "--enable-disagg", - action="store_true", - help="Enable remote prefill for the worker", - ) - - args = parser.parse_args(config_args) - return args diff --git a/examples/tensorrt_llm_sd/common/protocol.py b/examples/tensorrt_llm_sd/common/protocol.py deleted file mode 100644 index f05cdb9f8f..0000000000 --- a/examples/tensorrt_llm_sd/common/protocol.py +++ /dev/null @@ -1,104 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import base64 -from typing import List, Optional - -from pydantic import BaseModel, Field -from tensorrt_llm.llmapi import DisaggregatedParams as LlmDisaggregatedParams -from tensorrt_llm.serve.openai_protocol import DisaggregatedParams - - -class Tokens(BaseModel): - tokens: list[int] - - -TokenIdType = int - - -class DisaggregatedTypeConverter: - @staticmethod - def to_llm_disaggregated_params( - disaggregated_params: DisaggregatedParams, - ) -> LlmDisaggregatedParams: - if disaggregated_params is None: - return None - else: - opaque_state = ( - base64.b64decode(disaggregated_params.encoded_opaque_state) - if disaggregated_params.encoded_opaque_state is not None - else None - ) - - return LlmDisaggregatedParams( - request_type=disaggregated_params.request_type, - first_gen_tokens=disaggregated_params.first_gen_tokens, - ctx_request_id=disaggregated_params.ctx_request_id, - opaque_state=opaque_state, - ) - - @staticmethod - def to_oai_disaggregated_params( - tllm_disagg_params: LlmDisaggregatedParams, - ) -> DisaggregatedParams: - if tllm_disagg_params is None: - return None - else: - encoded_opaque_state = ( - base64.b64encode(tllm_disagg_params.opaque_state).decode("utf-8") - if tllm_disagg_params.opaque_state is not None - else None - ) - return DisaggregatedParams( - request_type=tllm_disagg_params.request_type, - first_gen_tokens=tllm_disagg_params.first_gen_tokens, - ctx_request_id=tllm_disagg_params.ctx_request_id, - encoded_opaque_state=encoded_opaque_state, - ) - - -# TODO: move these to common for all LLMs once we adopt dynamo-run -# derived from lib/llm/src/protocols/common/preprocessor.rs -class StopConditions(BaseModel): - max_tokens: Optional[int] = None - stop: Optional[List[str]] = None - stop_token_ids_hidden: Optional[List[TokenIdType]] = None - min_tokens: Optional[int] = None - ignore_eos: Optional[bool] = None - - -class SamplingOptions(BaseModel): - n: Optional[int] = None - best_of: Optional[int] = None - presence_penalty: Optional[float] = None - frequency_penalty: Optional[float] = None - repetition_penalty: Optional[float] = None - temperature: Optional[float] = None - top_p: Optional[float] = None - top_k: Optional[int] = None - min_p: Optional[float] = None - use_beam_search: Optional[bool] = None - length_penalty: Optional[float] = None - seed: Optional[int] = None - - -class TRTLLMWorkerRequest(BaseModel): - token_ids: List[TokenIdType] - stop_conditions: StopConditions - sampling_options: SamplingOptions - eos_token_ids: List[TokenIdType] = Field(default_factory=list) - mdc_sum: Optional[str] = None - annotations: List[str] = Field(default_factory=list) - estimated_prefix_hit_num_blocks: Optional[int] = None - disaggregated_params: Optional[DisaggregatedParams] = Field(default=None) diff --git a/examples/tensorrt_llm_sd/components/frontend.py b/examples/tensorrt_llm_sd/components/frontend.py deleted file mode 100644 index 98be2dfa33..0000000000 --- a/examples/tensorrt_llm_sd/components/frontend.py +++ /dev/null @@ -1,119 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import logging -import subprocess -from pathlib import Path - -from components.worker import TensorRTLLMWorker -from fastapi import FastAPI -from pydantic import BaseModel - -from dynamo import sdk -from dynamo.sdk import depends, service -from dynamo.sdk.lib.config import ServiceConfig -from dynamo.sdk.lib.image import DYNAMO_IMAGE - -logger = logging.getLogger(__name__) - - -def get_dynamo_run_binary(): - """Find the dynamo-run binary path in SDK or fallback to 'dynamo-run' command.""" - sdk_path = Path(sdk.__file__) - binary_path = sdk_path.parent / "cli/bin/dynamo-run" - if not binary_path.exists(): - return "dynamo-run" - else: - return str(binary_path) - - -class FrontendConfig(BaseModel): - """Configuration for the Frontend service including model and HTTP server settings.""" - - served_model_name: str - endpoint: str - port: int = 8000 - router: str = "round-robin" - block_size: int = 32 - - -# todo this should be called ApiServer -@service( - dynamo={ - "namespace": "dynamo", - }, - workers=1, - image=DYNAMO_IMAGE, - app=FastAPI(title="TensorRT-LLM Example"), -) -class Frontend: - worker = depends(TensorRTLLMWorker) - - def __init__(self): - """Initialize Frontend service with HTTP server and model configuration.""" - self.frontend_config = FrontendConfig( - **ServiceConfig.get_parsed_config("Frontend") - ) - self.process = None - - logger.warning(f"Frontend config: {self.frontend_config}") - - self.start_ingress_and_processor() - - def start_ingress_and_processor(self): - """Starting dynamo-run based ingress and processor""" - logger.info( - f"Starting HTTP server and processor on port {self.frontend_config.port}" - ) - dynamo_run_binary = get_dynamo_run_binary() - - cmd = [ - dynamo_run_binary, - "in=http", - "out=dyn", - "--http-port", - str(self.frontend_config.port), - "--router-mode", - self.frontend_config.router, - ] - - logger.info(f"Frontend cmd: {cmd}") - - self.process = subprocess.Popen( - cmd, - stdout=None, - stderr=None, - ) - - def close(self): - """Clean up resources by terminating the subprocess.""" - if self.process is not None: - try: - logger.info("Terminating subprocess...") - self.process.terminate() - # Wait for process to terminate with a timeout - self.process.wait(timeout=5) - except subprocess.TimeoutExpired: - logger.warning("Subprocess did not terminate gracefully, forcing kill") - self.process.kill() - self.process.wait() - except Exception as e: - logger.error(f"Error while terminating subprocess: {e}") - finally: - self.process = None - - def __del__(self): - """Destructor to ensure subprocess is cleaned up.""" - self.close() diff --git a/examples/tensorrt_llm_sd/components/prefill_worker.py b/examples/tensorrt_llm_sd/components/prefill_worker.py deleted file mode 100644 index 7e43d1fca7..0000000000 --- a/examples/tensorrt_llm_sd/components/prefill_worker.py +++ /dev/null @@ -1,75 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import logging - -from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine -from common.parser import parse_tensorrt_llm_args -from common.protocol import TRTLLMWorkerRequest - -from dynamo.sdk import async_on_start, dynamo_context, endpoint, on_shutdown, service -from dynamo.sdk.lib.config import ServiceConfig - -logger = logging.getLogger(__name__) - - -@service( - dynamo={ - "namespace": "dynamo", - }, - resources={"gpu": 1, "cpu": "10", "memory": "20Gi"}, - workers=1, -) -class TensorRTLLMPrefillWorker(BaseTensorrtLLMEngine): - def __init__(self): - logger.info("Initializing TensorRT-LLM Prefill Worker") - class_name = self.__class__.__name__ - config = ServiceConfig.get_instance() - config_args = config.as_args(class_name, prefix="") - args = parse_tensorrt_llm_args(config_args) - lease_id = dynamo_context["endpoints"][0].lease_id() - namespace, _ = TensorRTLLMPrefillWorker.dynamo_address() # type: ignore - - engine_config = BaseEngineConfig( - namespace=namespace, - component=class_name, - endpoint="generate", - model_path=args.model_path, - served_model_name=args.served_model_name, - kv_block_size=args.kv_block_size, - extra_engine_args=args.extra_engine_args, - publish_events_and_metrics=False, - disaggregation_mode="prefill", - remote_prefill_endpoint=None, - lease_id=lease_id, - ) - - super().__init__(config=engine_config) - - @async_on_start - async def async_init(self): - runtime = dynamo_context["runtime"] - await self.initialize(runtime) - logger.info("TensorRT-LLM Prefill Worker initialized") - - @on_shutdown - async def async_cleanup(self): - logger.info("Cleaning up TensorRT-LLM Prefill Worker") - await self.cleanup() - logger.info("TensorRT-LLM Prefill Worker cleanup completed") - - @endpoint() - async def generate(self, request: TRTLLMWorkerRequest): - async for response in super().generate(request): - yield response diff --git a/examples/tensorrt_llm_sd/components/worker.py b/examples/tensorrt_llm_sd/components/worker.py deleted file mode 100644 index 9074bfbe8d..0000000000 --- a/examples/tensorrt_llm_sd/components/worker.py +++ /dev/null @@ -1,115 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import logging - -from common.base_engine import BaseEngineConfig, BaseTensorrtLLMEngine -from common.parser import parse_tensorrt_llm_args -from common.protocol import TRTLLMWorkerRequest -from components.prefill_worker import TensorRTLLMPrefillWorker - -from dynamo.llm import ModelType, register_llm -from dynamo.sdk import ( - async_on_start, - depends, - dynamo_context, - endpoint, - on_shutdown, - service, -) -from dynamo.sdk.lib.config import ServiceConfig - -logger = logging.getLogger(__name__) - - -@service( - dynamo={ - "namespace": "dynamo", - }, - resources={"gpu": 1, "cpu": "10", "memory": "20Gi"}, - workers=1, -) -class TensorRTLLMWorker(BaseTensorrtLLMEngine): - prefill_worker = depends(TensorRTLLMPrefillWorker) - - def __init__(self): - logger.info("Initializing TensorRT-LLM Worker") - class_name = self.__class__.__name__ - config = ServiceConfig.get_instance() - config_args = config.as_args(class_name, prefix="") - args = parse_tensorrt_llm_args(config_args) - lease_id = dynamo_context["endpoints"][0].lease_id() - namespace, _ = TensorRTLLMWorker.dynamo_address() # type: ignore - endpoint_name = "generate" - publish_events_and_metrics = args.router == "kv" - prefill_class_name = "TensorRTLLMPrefillWorker" - - if args.enable_disagg: - disaggregation_mode = "decode" - else: - disaggregation_mode = "prefill_and_decode" - - engine_config = BaseEngineConfig( - namespace=namespace, - component=class_name, - endpoint=endpoint_name, - model_path=args.model_path, - served_model_name=args.served_model_name, - kv_block_size=args.kv_block_size, - extra_engine_args=args.extra_engine_args, - publish_events_and_metrics=publish_events_and_metrics, - disaggregation_mode=disaggregation_mode, - remote_prefill_endpoint=f"dyn://{namespace}.{prefill_class_name}.generate", - lease_id=lease_id, - ) - - super().__init__(config=engine_config) - - @async_on_start - async def async_init(self): - runtime = dynamo_context["runtime"] - await self.initialize(runtime) - - logger.info("Registering LLM for discovery") - endpoint = ( - runtime.namespace(self._config.namespace) - .component(self._config.component) - .endpoint(self._config.endpoint) - ) - - try: - await register_llm( - ModelType.Backend, - endpoint, - self._config.model_path, - self._config.served_model_name, - kv_cache_block_size=self._config.kv_block_size, - ) - logger.info("Successfully registered LLM for discovery") - except Exception as e: - logger.error(f"Failed to register LLM for discovery: {e}") - raise - - logger.info("TensorRT-LLM Worker initialized") - - @on_shutdown - async def async_cleanup(self): - logger.info("Cleaning up TensorRT-LLM Worker") - await self.cleanup() - logger.info("TensorRT-LLM Worker cleanup completed") - - @endpoint() - async def generate(self, request: TRTLLMWorkerRequest): - async for response in super().generate(request): - yield response diff --git a/examples/tensorrt_llm_sd/configs/agg.yaml b/examples/tensorrt_llm_sd/configs/agg.yaml deleted file mode 100644 index a3d4594ed8..0000000000 --- a/examples/tensorrt_llm_sd/configs/agg.yaml +++ /dev/null @@ -1,34 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -Frontend: - served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - endpoint: dynamo.TensorRTLLMWorker.generate - port: 8000 - router: round-robin - -TensorRTLLMWorker: - # Path to disk model or HuggingFace model identifier to load - model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - # Name to serve the model under - served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/engine_configs/agg_config.yaml" - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 1 \ No newline at end of file diff --git a/examples/tensorrt_llm_sd/configs/agg_router.yaml b/examples/tensorrt_llm_sd/configs/agg_router.yaml deleted file mode 100644 index 58f2a82ab3..0000000000 --- a/examples/tensorrt_llm_sd/configs/agg_router.yaml +++ /dev/null @@ -1,34 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -Frontend: - served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - endpoint: dynamo.TensorRTLLMWorker.generate - port: 8000 - router: kv - -TensorRTLLMWorker: - # Path to disk model or HuggingFace model identifier to load - model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - # Name to serve the model under - served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/engine_configs/agg_config.yaml" - router: kv - ServiceArgs: - workers: 1 - resources: - gpu: 1 \ No newline at end of file diff --git a/examples/tensorrt_llm_sd/configs/disagg.yaml b/examples/tensorrt_llm_sd/configs/disagg.yaml deleted file mode 100644 index 73c202f4be..0000000000 --- a/examples/tensorrt_llm_sd/configs/disagg.yaml +++ /dev/null @@ -1,48 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -Frontend: - served_model_name: nvidia/Llama-4-Maverick-17B-128E-Eagle3 - endpoint: dynamo.TensorRTLLMWorker.generate - port: 8000 - router: round-robin - -TensorRTLLMWorker: - # Path to disk model or HuggingFace model identifier to load - model-path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 - # Name to serve the model under - served_model_name: nvidia/Llama-4-Maverick-17B-128E-Eagle3 - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/engine_configs/decode_config.yaml" - enable-disagg: true - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 1 - -TensorRTLLMPrefillWorker: - # Path to disk model or HuggingFace model identifier to load - model-path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/engine_configs/prefill_config.yaml" - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 1 - diff --git a/examples/tensorrt_llm_sd/configs/disagg_router.yaml b/examples/tensorrt_llm_sd/configs/disagg_router.yaml deleted file mode 100644 index faae7f65a3..0000000000 --- a/examples/tensorrt_llm_sd/configs/disagg_router.yaml +++ /dev/null @@ -1,47 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -Frontend: - served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - endpoint: dynamo.TensorRTLLMWorker.generate - port: 8000 - router: kv - -TensorRTLLMWorker: - # Path to disk model or HuggingFace model identifier to load - model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - # Name to serve the model under - served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/engine_configs/decode_config.yaml" - enable-disagg: true - router: kv - ServiceArgs: - workers: 1 - resources: - gpu: 1 - -TensorRTLLMPrefillWorker: - # Path to disk model or HuggingFace model identifier to load - model-path: deepseek-ai/DeepSeek-R1-Distill-Llama-8B - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/engine_configs/prefill_config.yaml" - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 1 \ No newline at end of file diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml deleted file mode 100644 index 02b5cd8463..0000000000 --- a/examples/tensorrt_llm_sd/configs/engine_configs/agg_config.yaml +++ /dev/null @@ -1,31 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true - -kv_cache_config: - free_gpu_memory_fraction: 0.95 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -use_cuda_graph: true diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml deleted file mode 100644 index eb943fd6e7..0000000000 --- a/examples/tensorrt_llm_sd/configs/engine_configs/decode_config.yaml +++ /dev/null @@ -1,27 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -disable_overlap_scheduler: false -use_cuda_graph: true -kv_cache_config: - free_gpu_memory_fraction: 0.95 - diff --git a/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml deleted file mode 100644 index 5dee9e653d..0000000000 --- a/examples/tensorrt_llm_sd/configs/engine_configs/prefill_config.yaml +++ /dev/null @@ -1,28 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -tensor_parallel_size: 1 -moe_expert_parallel_size: 1 -enable_attention_dp: false -max_num_tokens: 8192 -max_batch_size: 16 -trust_remote_code: true -backend: pytorch -enable_chunked_prefill: true -# Overlap scheduler not currently supported in prefill only workers. -disable_overlap_scheduler: true -use_cuda_graph: false - -kv_cache_config: - free_gpu_memory_fraction: 0.95 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml deleted file mode 100644 index 3ac3facedd..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/agg.yaml +++ /dev/null @@ -1,31 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -Frontend: - # This is the client-facing model name, you can set this to anything you'd like. - served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - endpoint: dynamo.TensorRTLLMWorker.generate - port: 8000 - router: round-robin - -TensorRTLLMWorker: - served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - extra-engine-args: "configs/llama4_eagle3/engine_configs/agg_config.yaml" - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 4 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml deleted file mode 100644 index 9d96befbe5..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/disagg.yaml +++ /dev/null @@ -1,49 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -Frontend: - # This is the client-facing model name, you can set this to anything you'd like. - served_model_name: "nvidia/DeepSeek-R1-FP4" - endpoint: dynamo.TensorRTLLMWorker.generate - port: 8000 - router: round-robin - -TensorRTLLMWorker: - served_model_name: "nvidia/DeepSeek-R1-FP4" - # NOTE: FP4 only supported starting with Blackwell GPUs. - # https://huggingface.co/nvidia/DeepSeek-R1-FP4 - # You can also specify the full path to locally downloaded weights - # instead of a HuggingFace ID here. - model-path: "nvidia/DeepSeek-R1-FP4" - extra-engine-args: "configs/deepseek_r1/engine_configs/decode_config.yaml" - enable-disagg: true - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 4 - -TensorRTLLMPrefillWorker: - # NOTE: FP4 only supported starting with Blackwell GPUs. - # https://huggingface.co/nvidia/DeepSeek-R1-FP4 - # You can also specify the full path to locally downloaded weights - # instead of a HuggingFace ID here. - model-path: "nvidia/DeepSeek-R1-FP4" - extra-engine-args: "configs/deepseek_r1/engine_configs/prefill_config.yaml" - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 4 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml deleted file mode 100644 index 29dddba56f..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/agg_config.yaml +++ /dev/null @@ -1,54 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# TP/EP/PP/DP -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -pipeline_parallel_size: 1 -enable_attention_dp: false - -max_batch_size: 256 -# 8448 = 8192 ISL + 256 OSL -max_num_tokens: 8448 -max_seq_len: 8448 - -kv_cache_config: - # With dp attention disabled: high free_gpu_memory_fraction is fine. - free_gpu_memory_fraction: 0.85 - # With dp attention enabled: large ISL at high concurrency may need - # free_gpu_memory_fraction low to have enough available memory. - # free_gpu_memory_fraction: 0.30 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -use_cuda_graph: true -cuda_graph_padding_enabled: true -# NOTE: For larger max batch size, you may want to add larger cuda graph -# batch sizes below to match. -cuda_graph_batch_sizes: -- 1 -- 2 -- 4 -- 8 -- 16 -- 32 -- 64 -- 128 -- 256 -print_iter_log: true -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml deleted file mode 100644 index 772b94b283..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/decode_config.yaml +++ /dev/null @@ -1,55 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# TP/EP/PP/DP -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -pipeline_parallel_size: 1 -enable_attention_dp: false - -max_batch_size: 256 -max_num_tokens: 256 -# 8448 = 8192 ISL + 256 OSL -max_seq_len: 8448 - -kv_cache_config: - # With dp attention disabled: high free_gpu_memory_fraction is fine. - free_gpu_memory_fraction: 0.85 - # With dp attention enabled: large ISL at high concurrency may need - # free_gpu_memory_fraction low to have enough available memory. - # free_gpu_memory_fraction: 0.30 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -disable_overlap_scheduler: false -use_cuda_graph: true -cuda_graph_padding_enabled: true -# NOTE: For larger max batch size, you may want to add larger cuda graph -# batch sizes below to match. -cuda_graph_batch_sizes: -- 1 -- 2 -- 4 -- 8 -- 16 -- 32 -- 64 -- 128 -- 256 -print_iter_log: true -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml deleted file mode 100644 index 6ae899a68a..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/engine_configs/prefill_config.yaml +++ /dev/null @@ -1,37 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# TP/EP/PP/DP -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -pipeline_parallel_size: 1 -enable_attention_dp: true - -max_batch_size: 1 -max_num_tokens: 8192 -max_seq_len: 8192 - -kv_cache_config: - free_gpu_memory_fraction: 0.75 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -disable_overlap_scheduler: true -print_iter_log: true -# NOTE: This dtype must match in both prefill/decode configs -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml deleted file mode 100644 index f0b5411221..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml +++ /dev/null @@ -1,50 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -enable_attention_dp: true -max_batch_size: 256 -# 8448 = 8192 ISL + 256 OSL -max_num_tokens: 8448 -max_seq_len: 8448 -kv_cache_config: - free_gpu_memory_fraction: 0.30 - -# Enable the MTP(Multi-Token Prediction) in the model engine -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 - -use_cuda_graph: true -cuda_graph_padding_enabled: true -cuda_graph_batch_sizes: -- 1 -- 2 -- 4 -- 8 -- 16 -- 32 -- 64 -- 128 -- 256 -print_iter_log: true -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml deleted file mode 100644 index ab48b2e78b..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/decode_config.yaml +++ /dev/null @@ -1,53 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -enable_attention_dp: false -max_batch_size: 256 -# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula: -# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1) -# This is a known issue in TensorRT-LLM and will be resolved in the next release. -max_num_tokens: 512 -# 8704 = 8192 ISL + 512 OSL -max_seq_len: 8704 -kv_cache_config: - free_gpu_memory_fraction: 0.85 - -# Enable the MTP(Multi-Token Prediction) in decode model engine -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 - -use_cuda_graph: true -cuda_graph_padding_enabled: true -cuda_graph_batch_sizes: -- 1 -- 2 -- 4 -- 8 -- 16 -- 32 -- 64 -- 128 -- 256 -print_iter_log: true -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml deleted file mode 100644 index ee6ee26a94..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/engine_configs/prefill_config.yaml +++ /dev/null @@ -1,37 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - -backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 4 -enable_attention_dp: true -max_batch_size: 1 -max_num_tokens: 8192 -max_seq_len: 8192 -kv_cache_config: - free_gpu_memory_fraction: 0.75 -print_iter_log: true -kv_cache_dtype: fp8 -disable_overlap_scheduler: true - -# Enable the MTP(Multi-Token Prediction) in the prefill model engine -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml deleted file mode 100644 index 626ca27953..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_agg.yaml +++ /dev/null @@ -1,32 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -Frontend: - served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - endpoint: dynamo.TensorRTLLMWorker.generate - port: 8000 - router: round-robin - -TensorRTLLMWorker: - served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/llama4_eagle3/mtp/engine_configs/agg_config.yaml" - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 4 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml deleted file mode 100644 index 5fe2679809..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/mtp/mtp_disagg.yaml +++ /dev/null @@ -1,52 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -Frontend: - served_model_name: "nvidia/DeepSeek-R1-FP4" - endpoint: dynamo.TensorRTLLMWorker.generate - port: 8000 - router: round-robin - -TensorRTLLMWorker: - served_model_name: "nvidia/DeepSeek-R1-FP4" - # NOTE: FP4 only supported starting with Blackwell GPUs. - # https://huggingface.co/nvidia/DeepSeek-R1-FP4 - # You can also specify the full path to locally downloaded weights - # instead of a HuggingFace ID here. - model-path: "nvidia/DeepSeek-R1-FP4" - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/decode_config.yaml" - router: round-robin - enable-disagg: true - ServiceArgs: - workers: 1 - resources: - gpu: 4 - -TensorRTLLMPrefillWorker: - # NOTE: FP4 only supported starting with Blackwell GPUs. - # https://huggingface.co/nvidia/DeepSeek-R1-FP4 - # You can also specify the full path to locally downloaded weights - # instead of a HuggingFace ID here. - model-path: "nvidia/DeepSeek-R1-FP4" - # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. - # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. - extra-engine-args: "configs/deepseek_r1/mtp/engine_configs/prefill_config.yaml" - router: round-robin - ServiceArgs: - workers: 1 - resources: - gpu: 4 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md deleted file mode 100644 index 342cd45129..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/README.md +++ /dev/null @@ -1,275 +0,0 @@ - - -# Example: Multi-node TRTLLM Workers with Dynamo on Slurm - -To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16), -the set of nodes need to be launched together in the same MPI world, such as -via `mpirun` or `srun`. This is true regardless of whether the worker is -aggregated, prefill-only, or decode-only. - -In this document we will demonstrate two examples launching multinode workers -on a slurm cluster with `srun`: -1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16 - worker across 4 GB200 nodes -2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node - TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode - worker (4 nodes) across a total of 8 GB200 nodes. - -NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and -`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or -using `mpirun` directly, with relative ease. - -## Setup - -For simplicity of the example, we will make some assumptions about your slurm cluster: -1. First, we assume you have access to a slurm cluster with multiple GPU nodes - available. For functional testing, most setups should be fine. For performance - testing, you should aim to allocate groups of nodes that are performantly - inter-connected, such as those in an NVL72 setup. -2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis) - SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this - example will use `srun` arguments like `--container-image`, - `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis. - If your cluster supports similar container based plugins, you may be able to - modify the script to use that instead. -3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as - described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker). - This is the image that can be set to the `IMAGE` environment variable in later steps. -4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We - will allocate 8 nodes below as a reference command to have enough capacity - to run both examples. If you plan to only run the aggregated example, you - will only need 4 nodes. If you customize the configurations to require a - different number of nodes, you can adjust the number of allocated nodes - accordingly. Pre-allocating nodes is technically not a requirement, - but it makes iterations of testing/experimenting easier. - - Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup: - ```bash - # Set partition manually based on your slurm cluster's partition names - PARTITION="" - # Set account manually if this command doesn't work on your cluster - ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)" - salloc \ - --partition="${PARTITION}" \ - --account="${ACCOUNT}" \ - --job-name="${ACCOUNT}-dynamo.trtllm" \ - -t 05:00:00 \ - --nodes 8 - ``` -5. Lastly, we will assume you are inside an interactive shell on one of your allocated - nodes, which may be the default behavior after executing the `salloc` command above - depending on the cluster setup. If not, then you should SSH into one of the allocated nodes. - -### Environment Variable Setup - -This example aims to automate as much of the environment setup as possible, -but all slurm clusters and environments are different, and you may need to -dive into the scripts to make modifications based on your specific environment. - -Assuming you have already allocated your nodes via `salloc`, and are -inside an interactive shell on one of the allocated nodes, set the -following environment variables based: -```bash -# NOTE: IMAGE must be set manually for now -# To build an iamge, see the steps here: -# https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker -export IMAGE="" - -# MOUNTS are the host:container path pairs that are mounted into the containers -# launched by each `srun` command. -# -# If you want to reference files, such as $MODEL_PATH below, in a -# different location, you can customize MOUNTS or specify additional -# comma-separated mount pairs here. -# -# NOTE: Currently, this example assumes that the local bash scripts and configs -# referenced are mounted into into /mnt inside the container. If you want to -# customize the location of the scripts, make sure to modify `srun_aggregated.sh` -# accordingly for the new locations of `start_frontend_services.sh` and -# `start_trtllm_worker.sh`. -# -# For example, assuming your cluster had a `/lustre` directory on the host, you -# could add that as a mount like so: -# -# export MOUNTS="${PWD}:/mnt,/lustre:/lustre" -export MOUNTS="${PWD}:/mnt" - -# NOTE: In general, Deepseek R1 is very large, so it is recommended to -# pre-download the model weights and save them in some shared location, -# NFS storage, HF_CACHE, etc. and modify the `--model-path` below -# to reuse the pre-downloaded weights instead. -# -# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights: -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# -# On Hopper systems, FP4 isn't supported so you'll need to use the default weights: -# https://huggingface.co/deepseek-ai/DeepSeek-R1 -export MODEL_PATH="nvidia/DeepSeek-R1-FP4" - -# The name the model will be served/queried under, matching what's -# returned by the /v1/models endpoint. -# -# By default this is inferred from MODEL_PATH, but when using locally downloaded -# model weights, it can be nice to have explicit control over the name. -export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4" -``` - -## Aggregated WideEP - -Assuming you have at least 4 nodes allocated following the setup steps above, -follow these steps below to launch an **aggregated** deployment across 4 nodes: - -```bash -# Default set in srun_aggregated.sh, but can customize here. -# export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg.yaml" - -# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG -# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of -# total GPUs necessary to satisfy the requested parallelism. For example, -# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16. -# export NUM_NODES=4 - -# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this. -# export NUM_GPUS_PER_NODE=4 - -# Launches: -# - frontend + etcd/nats on current (head) node -# - one large aggregated trtllm worker across multiple nodes via MPI tasks -./srun_aggregated.sh -``` - -## Disaggregated WideEP - -Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode) -following the setup above, follow these steps below to launch a **disaggregated** -deployment across 8 nodes: - -> [!Tip] -> Make sure you have a fresh environment and don't still have the aggregated -> example above still deployed on the same set of nodes. - -```bash -# Defaults set in srun_disaggregated.sh, but can customize here. -# export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml" -# export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml" - -# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG -# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG -# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and -# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of -# GPUs necessary to satisfy the requested parallelism in each config. -# export NUM_PREFILL_NODES=4 -# export NUM_DECODE_NODES=4 - -# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this. -# export NUM_GPUS_PER_NODE=4 - -# Launches: -# - frontend + etcd/nats on current (head) node. -# - one large prefill trtllm worker across multiple nodes via MPI tasks -# - one large decode trtllm worker across multiple nodes via MPI tasks -./srun_disaggregated.sh -``` - -## Understanding the Output - -1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches - etcd, NATS, and the OpenAI frontend on the head node only - called "node1" in the example output below. The second launches - a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node - using 4 GPUs each. - ``` - # Frontend/etcd/nats services - srun: launching StepId=453374.17 on host node1, 1 tasks: 0 - ... - # TP16 TRTLLM worker split across 4 nodes with 4 gpus each - srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3] - srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7] - srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11] - srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15] - ``` -2. The OpenAI frontend will listen for and dynamically discover workers as - they register themselves with Dynamo's distributed runtime: - ``` - 0: 2025-06-13T02:36:48.160Z INFO dynamo_run::input::http: Watching for remote model at models - 0: 2025-06-13T02:36:48.161Z INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000" - ``` -3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each - GPU on each node, which will each output their progress while loading the model. - You can see each rank's output prefixed with the rank at the start of each log line - until the model succesfully finishes loading: - ``` - 8: rank8 run mgmn worker node with mpi_world_size: 16 ... - 10: rank10 run mgmn worker node with mpi_world_size: 16 ... - 9: rank9 run mgmn worker node with mpi_world_size: 16 ... - 11: rank11 run mgmn worker node with mpi_world_size: 16 ... - ... - 15: Model init total -- 55.42s - 11: Model init total -- 55.91s - 12: Model init total -- 55.24s - ``` -4. After the model fully finishes loading on all ranks, the worker will register itself, - and the OpenAI frontend will detect it, signaled by this output: - ``` - 0: 2025-06-13T02:46:35.040Z INFO dynamo_llm::discovery::watcher: added model model_name="nvidia/DeepSeek-R1-FP4" - ``` -5. At this point, with the worker fully initialized and detected by the frontend, - it is now ready for inference. -6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches - three srun jobs instead of two. One for frontend, one for prefill worker, - and one for decode worker. - -## Example Request - -To verify the deployed model is working, send a `curl` request: -```bash -# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead. -HOST=localhost -PORT=8000 -# "model" here should match the model name returned by the /v1/models endpoint -curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "'${SERVED_MODEL_NAME}'", - "messages": [ - { - "role": "user", - "content": "Tell me a story as if we were playing dungeons and dragons." - } - ], - "stream": true, - "max_tokens": 30 -}' -``` - -## Cleanup - -To cleanup background `srun` processes launched by `srun_aggregated.sh` or -`srun_disaggregated.sh`, you can run: -```bash -pkill srun -``` - -## Known Issues - -- This example has only been tested on a 4xGB200 node setup with 16 GPUs using - FP4 weights. In theory, the example should work on alternative setups such as - H100 nodes with FP8 weights, but this hasn't been tested yet. -- This example only tests an aggregated model setup for now. A disaggregated - serving example will be added in the near future. -- WideEP configs in this directory are still being tested. A WideEP specific - example with documentation will be added once ready. -- There are known issues where WideEP workers may not cleanly shut down: - - This may lead to leftover shared memory files in `/dev/shm/moe_*`. For - now, you must manually clean these up before deploying again on the - same set of nodes. - - Similarly, there may be GPU memory left in-use after killing the `srun` - jobs. After cleaning up any leftover shared memory files as described - above, the GPU memory may slowly come back. You can run `watch nvidia-smi` - to check on this behavior. If you don't free the GPU memory before the - next deployment, you may get a CUDA OOM error while loading the model. - - There is mention of this issue in the relevant TRT-LLM blog - [here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous). diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml deleted file mode 100644 index d697caacfa..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/dep16_agg.yaml +++ /dev/null @@ -1,27 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Example of a Multi-node worker, but no WideEP or EPLB. -# See wide_ep*.yaml for WideEP example configs. -backend: pytorch -tensor_parallel_size: 16 -moe_expert_parallel_size: 16 -enable_attention_dp: true -max_batch_size: 256 -max_num_tokens: 256 -max_seq_len: 8448 -kv_cache_config: - free_gpu_memory_fraction: 0.7 -use_cuda_graph: true -cuda_graph_padding_enabled: true -cuda_graph_batch_sizes: -- 1 -- 2 -- 4 -- 8 -- 16 -- 32 -- 64 -- 128 -- 256 -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml deleted file mode 100644 index f2fe0a13c6..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/eplb.yaml +++ /dev/null @@ -1,7 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# moe_load_balancer settings for TRTLLM based on: -# https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ep_load_balancer/README.md#online-ep-load-balancer -num_slots: 288 -layer_updates_per_iter: 2 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml deleted file mode 100644 index 5bbc66bd69..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_agg.yaml +++ /dev/null @@ -1,35 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -backend: pytorch - -# WideEP related settings -moe_backend: WideEP -# moe_max_num_tokens will default to max_num_tokens if left unspecified. -# -# If you want to set this value explicitly, one recommendation is below: -# moe_max_num_tokens = max_batch_size * moe_expert_parallel_size -# 4096 = 256 * 16 -# moe_max_num_tokens: 4096 -moe_load_balancer: /mnt/engine_configs/eplb.yaml -tensor_parallel_size: 16 -moe_expert_parallel_size: 16 - -enable_attention_dp: true -max_batch_size: 256 -max_num_tokens: 256 -max_seq_len: 8448 -kv_cache_config: - free_gpu_memory_fraction: 0.7 -use_cuda_graph: true -cuda_graph_padding_enabled: true -cuda_graph_batch_sizes: -- 1 -- 2 -- 4 -- 8 -- 16 -- 32 -- 64 -- 128 -- 256 -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml deleted file mode 100644 index ac7fc7e8f6..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_decode.yaml +++ /dev/null @@ -1,59 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# WideEP related settings -moe_backend: WideEP -moe_load_balancer: /mnt/engine_configs/eplb.yaml - -# TP/EP/PP/DP -tensor_parallel_size: 16 -moe_expert_parallel_size: 16 -pipeline_parallel_size: 1 -enable_attention_dp: true - -max_batch_size: 256 -max_num_tokens: 256 -# 8448 = 8192 ISL + 256 OSL -max_seq_len: 8448 - -kv_cache_config: - # With dp attention disabled: high free_gpu_memory_fraction is fine. - # free_gpu_memory_fraction: 0.85 - # With dp attention enabled: large ISL at high concurrency may need - # free_gpu_memory_fraction low to have enough available memory. - free_gpu_memory_fraction: 0.30 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -disable_overlap_scheduler: false -use_cuda_graph: true -cuda_graph_padding_enabled: true -# NOTE: For larger max batch size, you may want to add larger cuda graph -# batch sizes below to match. -cuda_graph_batch_sizes: -- 1 -- 2 -- 4 -- 8 -- 16 -- 32 -- 64 -- 128 -- 256 -print_iter_log: true -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml deleted file mode 100644 index 06968a3a76..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/engine_configs/wide_ep_prefill.yaml +++ /dev/null @@ -1,41 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -backend: pytorch - -# WideEP related settings -moe_backend: WideEP -moe_load_balancer: /mnt/engine_configs/eplb.yaml - -# TP/EP/PP/DP -tensor_parallel_size: 16 -moe_expert_parallel_size: 16 -pipeline_parallel_size: 1 -enable_attention_dp: true - -max_batch_size: 1 -max_num_tokens: 8192 -max_seq_len: 8192 - -kv_cache_config: - free_gpu_memory_fraction: 0.75 - -# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603 -# NOTE: overlap_scheduler enabled by default since this commit and changed -# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler': -# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428 -disable_overlap_scheduler: true -print_iter_log: true -# NOTE: This dtype must match in both prefill/decode configs -kv_cache_dtype: fp8 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh deleted file mode 100755 index 5a632551b9..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_aggregated.sh +++ /dev/null @@ -1,75 +0,0 @@ -#!/bin/bash -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# This is one of the only variables that must be set currently, most of the rest may -# just work out of the box if following the steps in the README. -IMAGE="${IMAGE:-""}" - -# Set to mount current host directory to /mnt inside the container as an example, -# but you may freely customize the mounts based on your cluster. A common practice -# is to mount paths to NFS storage for common scripts, model weights, etc. -# NOTE: This can be a comma separated list of multiple mounts as well. -DEFAULT_MOUNT="${PWD}:/mnt" -MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}" - -# Example values, assuming 4 nodes with 4 GPUs on each node, such as 4xGB200 nodes. -# For 8xH100 nodes as an example, you may set this to 2 nodes x 8 gpus/node instead. -NUM_NODES=${NUM_NODES:-4} -NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4} - -export ENGINE_CONFIG="${ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_agg.yaml}" - -# Automate settings of certain variables for convenience, but you are free -# to manually set these for more control as well. -ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)" -export HEAD_NODE="${SLURMD_NODENAME}" -export HEAD_NODE_IP="$(hostname -i)" -export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" -export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" - -if [[ -z ${IMAGE} ]]; then - echo "ERROR: You need to set the IMAGE environment variable to the " \ - "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \ - "See how to build one from source here: " \ - "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker" - exit 1 -fi - -# NOTE: Output streamed to stdout for ease of understanding the example, but -# in practice you would probably set `srun --output ... --error ...` to pipe -# the stdout/stderr to files. -echo "Launching frontend services in background." -srun \ - --overlap \ - --container-image "${IMAGE}" \ - --container-mounts "${MOUNTS}" \ - --verbose \ - --label \ - -A "${ACCOUNT}" \ - -J "${ACCOUNT}-dynamo.trtllm" \ - --nodelist "${HEAD_NODE}" \ - --nodes 1 \ - --jobid "${SLURM_JOB_ID}" \ - /mnt/start_frontend_services.sh & - -# NOTE: Output streamed to stdout for ease of understanding the example, but -# in practice you would probably set `srun --output ... --error ...` to pipe -# the stdout/stderr to files. -echo "Launching multi-node worker in background." -# No --task for the worker defaults to aggregated mode -TASK="" \ -srun \ - --mpi pmix \ - --oversubscribe \ - --container-image "${IMAGE}" \ - --container-mounts "${MOUNTS}" \ - --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \ - --verbose \ - --label \ - -A "${ACCOUNT}" \ - -J "${ACCOUNT}-dynamo.trtllm" \ - --nodes "${NUM_NODES}" \ - --ntasks-per-node "${NUM_GPUS_PER_NODE}" \ - --jobid "${SLURM_JOB_ID}" \ - /mnt/start_trtllm_worker.sh & diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh deleted file mode 100755 index 32cb4993a9..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/srun_disaggregated.sh +++ /dev/null @@ -1,94 +0,0 @@ -#!/bin/bash -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# This is one of the only variables that must be set currently, most of the rest may -# just work out of the box if following the steps in the README. -IMAGE="${IMAGE:-""}" - -# Set to mount current host directory to /mnt inside the container as an example, -# but you may freely customize the mounts based on your cluster. A common practice -# is to mount paths to NFS storage for common scripts, model weights, etc. -# NOTE: This can be a comma separated list of multiple mounts as well. -DEFAULT_MOUNT="${PWD}:/mnt" -MOUNTS="${MOUNTS:-${DEFAULT_MOUNT}}" - -NUM_GPUS_PER_NODE=${NUM_GPUS_PER_NODE:-4} - -NUM_PREFILL_NODES=${NUM_PREFILL_NODES:-4} -PREFILL_ENGINE_CONFIG="${PREFILL_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_prefill.yaml}" - -NUM_DECODE_NODES=${NUM_DECODE_NODES:-4} -DECODE_ENGINE_CONFIG="${DECODE_ENGINE_CONFIG:-/mnt/engine_configs/wide_ep_decode.yaml}" - -# Automate settings of certain variables for convenience, but you are free -# to manually set these for more control as well. -ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)" -export HEAD_NODE="${SLURMD_NODENAME}" -export HEAD_NODE_IP="$(hostname -i)" -export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" -export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" - -if [[ -z ${IMAGE} ]]; then - echo "ERROR: You need to set the IMAGE environment variable to the " \ - "Dynamo+TRTLLM docker image or .sqsh file from 'enroot import' " \ - "See how to build one from source here: " \ - "https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker" - exit 1 -fi - -# NOTE: Output streamed to stdout for ease of understanding the example, but -# in practice you would probably set `srun --output ... --error ...` to pipe -# the stdout/stderr to files. -echo "Launching frontend services in background." -srun \ - --overlap \ - --container-image "${IMAGE}" \ - --container-mounts "${MOUNTS}" \ - --verbose \ - --label \ - -A "${ACCOUNT}" \ - -J "${ACCOUNT}-dynamo.trtllm" \ - --nodelist "${HEAD_NODE}" \ - --nodes 1 \ - --jobid "${SLURM_JOB_ID}" \ - /mnt/start_frontend_services.sh & - -# NOTE: Output streamed to stdout for ease of understanding the example, but -# in practice you would probably set `srun --output ... --error ...` to pipe -# the stdout/stderr to files. -echo "Launching multi-node prefill worker in background." -TASK=prefill \ -ENGINE_CONFIG=${PREFILL_ENGINE_CONFIG} \ -srun \ - --mpi pmix \ - --oversubscribe \ - --container-image "${IMAGE}" \ - --container-mounts "${MOUNTS}" \ - --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \ - --verbose \ - --label \ - -A "${ACCOUNT}" \ - -J "${ACCOUNT}-dynamo.trtllm" \ - --nodes "${NUM_PREFILL_NODES}" \ - --ntasks-per-node "${NUM_GPUS_PER_NODE}" \ - --jobid "${SLURM_JOB_ID}" \ - /mnt/start_trtllm_worker.sh & - -echo "Launching multi-node decode worker in background." -TASK=decode \ -ENGINE_CONFIG=${DECODE_ENGINE_CONFIG} \ -srun \ - --mpi pmix \ - --oversubscribe \ - --container-image "${IMAGE}" \ - --container-mounts "${MOUNTS}" \ - --container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE,TASK,ENGINE_CONFIG \ - --verbose \ - --label \ - -A "${ACCOUNT}" \ - -J "${ACCOUNT}-dynamo.trtllm" \ - --nodes "${NUM_DECODE_NODES}" \ - --ntasks-per-node "${NUM_GPUS_PER_NODE}" \ - --jobid "${SLURM_JOB_ID}" \ - /mnt/start_trtllm_worker.sh & diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh deleted file mode 100755 index 0d1b588904..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_frontend_services.sh +++ /dev/null @@ -1,16 +0,0 @@ -#!/bin/bash -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# Start NATS -nats-server -js & - -# Start etcd -etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd & - -# Wait for NATS/etcd to startup -sleep 3 - -# Start OpenAI Frontend which will dynamically discover workers when they startup -# NOTE: This is a blocking call. -dynamo-run in=http out=dyn --http-port 8000 diff --git a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh b/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh deleted file mode 100755 index 257b3b1127..0000000000 --- a/examples/tensorrt_llm_sd/configs/llama4_eagle3/multinode/start_trtllm_worker.sh +++ /dev/null @@ -1,46 +0,0 @@ -#!/bin/bash -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -if [[ -z ${MODEL_PATH} ]]; then - echo "ERROR: MODEL_PATH was not set." - echo "ERROR: MODEL_PATH must be set to either the HuggingFace ID or locally " \ - "downloaded path to the model weights. Since Deepseek R1 is large, it is " \ - "recommended to pre-download them to a shared location and provide the path." - exit 1 -fi - -if [[ -z ${SERVED_MODEL_NAME} ]]; then - echo "WARNING: SERVED_MODEL_NAME was not set. It will be derived from MODEL_PATH." -fi - - - -if [[ -z ${ENGINE_CONFIG} ]]; then - echo "ERROR: ENGINE_CONFIG was not set." - echo "ERROR: ENGINE_CONFIG must be set to a valid Dynamo+TRTLLM engine config file." - exit 1 -fi - -EXTRA_ARGS="" -if [[ -n ${TASK} ]]; then - EXTRA_ARGS+="--task ${TASK}" -fi - -# NOTE: When this script is run directly from srun, the environment variables -# for TRTLLM KV cache are not set. So we need to set them here. -# Related issue: https://github.com/ai-dynamo/dynamo/issues/1743 -if [[ -z ${TRTLLM_USE_UCX_KVCACHE} ]] && [[ -z ${TRTLLM_USE_NIXL_KVCACHE} ]]; then - export TRTLLM_USE_UCX_KVCACHE=1 -fi - -# NOTE: trtllm_inc.py is a standalone python script that launches a Dynamo+TRTLLM -# worker and registers itself with the runtime. It is currently easier to wrap -# this standalone script with `trtllm-llmapi-launch` for MPI handling purposes, -# but this may be refactored into 'dynamo serve' in the future. -trtllm-llmapi-launch \ - python3 /workspace/launch/dynamo-run/src/subprocess/trtllm_inc.py \ - --model-path "${MODEL_PATH}" \ - --model-name "${SERVED_MODEL_NAME}" \ - --extra-engine-args "${ENGINE_CONFIG}" \ - ${EXTRA_ARGS} diff --git a/examples/tensorrt_llm_sd/graphs/agg.py b/examples/tensorrt_llm_sd/graphs/agg.py deleted file mode 100644 index e79f5f315c..0000000000 --- a/examples/tensorrt_llm_sd/graphs/agg.py +++ /dev/null @@ -1,19 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from components.frontend import Frontend -from components.worker import TensorRTLLMWorker - -Frontend.link(TensorRTLLMWorker) diff --git a/examples/tensorrt_llm_sd/graphs/disagg.py b/examples/tensorrt_llm_sd/graphs/disagg.py deleted file mode 100644 index 58bde05d9a..0000000000 --- a/examples/tensorrt_llm_sd/graphs/disagg.py +++ /dev/null @@ -1,20 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from components.frontend import Frontend -from components.prefill_worker import TensorRTLLMPrefillWorker -from components.worker import TensorRTLLMWorker - -Frontend.link(TensorRTLLMWorker).link(TensorRTLLMPrefillWorker) From 07a822c29bb7b61068bb1e784c8c423798a91684 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Wed, 9 Jul 2025 10:19:43 -0700 Subject: [PATCH 07/20] Standardizing names, Fixing File paths --- .../configs/llama4/eagle/{mtp_agg.yaml => eagle_agg.yaml} | 2 +- .../llama4/eagle/{mtp_disagg.yaml => eagle_disagg.yaml} | 4 ++-- .../configs/llama4/eagle/engine_configs/agg_config.yaml | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) rename examples/tensorrt_llm/configs/llama4/eagle/{mtp_agg.yaml => eagle_agg.yaml} (93%) rename examples/tensorrt_llm/configs/llama4/eagle/{mtp_disagg.yaml => eagle_disagg.yaml} (98%) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml similarity index 93% rename from examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml rename to examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml index 6a64336101..9ea7e3c265 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/mtp_agg.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml @@ -23,7 +23,7 @@ Frontend: TensorRTLLMWorker: served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - extra-engine-args: "configs/llama4/engine_configs/agg_config.yaml" + extra-engine-args: "configs/llama4/eagle/engine_configs/agg_config.yaml" router: round-robin ServiceArgs: workers: 1 diff --git a/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml similarity index 98% rename from examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml rename to examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml index 72d3ce6f29..a04a9114c2 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/mtp_disagg.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml @@ -34,7 +34,7 @@ TensorRTLLMWorker: ServiceArgs: workers: 1 resources: - gpu: 4 + gpu: 8 TensorRTLLMPrefillWorker: # NOTE: FP4 only supported starting with Blackwell GPUs. @@ -49,4 +49,4 @@ TensorRTLLMPrefillWorker: ServiceArgs: workers: 1 resources: - gpu: 4 + gpu: 8 diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml index 633d630633..053bad7e0f 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -20,14 +20,14 @@ backend: pytorch tensor_parallel_size: 4 -moe_expert_parallel_size: 4 +moe_expert_parallel_size: 1 # enable_attention_dp: true max_batch_size: 256 # 8448 = 8192 ISL + 256 OSL max_num_tokens: 8448 max_seq_len: 8448 kv_cache_config: - free_gpu_memory_fraction: 0.30 + free_gpu_memory_fraction: 0.25 # Enable the MTP(Multi-Token Prediction) in the model engine speculative_config: From 070962d5f6213e17f996ad72bbd7568bf06b765a Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Wed, 9 Jul 2025 13:52:52 -0700 Subject: [PATCH 08/20] Correcting Model Name and Decoding Type --- .../tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml | 6 +++--- .../configs/llama4/eagle/eagle_disagg.yaml | 12 ++++-------- .../llama4/eagle/engine_configs/agg_config.yaml | 2 +- .../llama4/eagle/engine_configs/decode_config.yaml | 2 +- .../llama4/eagle/engine_configs/prefill_config.yaml | 2 +- 5 files changed, 10 insertions(+), 14 deletions(-) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml index 9ea7e3c265..bcf97cfa64 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml @@ -15,14 +15,14 @@ Frontend: # This is the client-facing model name, you can set this to anything you'd like. - served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + served_model_name: "meta-llama/Llama-4-Maverick-17B-128E" endpoint: dynamo.TensorRTLLMWorker.generate port: 8000 router: round-robin TensorRTLLMWorker: - served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + served_model_name: "meta-llama/Llama-4-Maverick-17B-128E" + model-path: "meta-llama/Llama-4-Maverick-17B-128E" extra-engine-args: "configs/llama4/eagle/engine_configs/agg_config.yaml" router: round-robin ServiceArgs: diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml index a04a9114c2..429acd1d12 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml @@ -14,18 +14,14 @@ # limitations under the License. Frontend: - served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + served_model_name: "meta-llama/Llama-4-Maverick-17B-128E" endpoint: dynamo.TensorRTLLMWorker.generate port: 8000 router: round-robin TensorRTLLMWorker: - served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" - # NOTE: FP4 only supported starting with Blackwell GPUs. - # https://huggingface.co/nvidia/DeepSeek-R1-FP4 - # You can also specify the full path to locally downloaded weights - # instead of a HuggingFace ID here. - model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + served_model_name: "meta-llama/Llama-4-Maverick-17B-128E" + model-path: "meta-llama/Llama-4-Maverick-17B-128E" # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. extra-engine-args: "configs/llama4/eagle/engine_configs/decode_config.yaml" @@ -41,7 +37,7 @@ TensorRTLLMPrefillWorker: # https://huggingface.co/nvidia/DeepSeek-R1-FP4 # You can also specify the full path to locally downloaded weights # instead of a HuggingFace ID here. - model-path: "nvidia/Llama-4-Maverick-17B-128E-Eagle3" + model-path: "meta-llama/Llama-4-Maverick-17B-128E" # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. extra-engine-args: "configs/llama4/eagle/engine_configs/prefill_config.yaml" diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml index 053bad7e0f..d6528c891c 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -31,7 +31,7 @@ kv_cache_config: # Enable the MTP(Multi-Token Prediction) in the model engine speculative_config: - decoding_type: MTP + decoding_type: Eagle num_nextn_predict_layers: 1 use_cuda_graph: true diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml index fed64bcb22..16bca4c893 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml @@ -34,7 +34,7 @@ kv_cache_config: # Enable the MTP(Multi-Token Prediction) in decode model engine speculative_config: - decoding_type: MTP + decoding_type: Eagle num_nextn_predict_layers: 1 use_cuda_graph: true diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml index 6dd4bca5ed..6286876f45 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml @@ -33,5 +33,5 @@ disable_overlap_scheduler: true # Enable the MTP(Multi-Token Prediction) in the prefill model engine speculative_config: - decoding_type: MTP + decoding_type: Eagle num_nextn_predict_layers: 1 From d02beeb0764c24fa86b643fb1949cb00fab9ecd4 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Wed, 9 Jul 2025 23:53:12 -0700 Subject: [PATCH 09/20] Adding Speculative Decoding/KV Config Fields --- .../llama4/eagle/engine_configs/agg_config.yaml | 14 +++++++++----- .../llama4/eagle/engine_configs/decode_config.yaml | 13 ++++++++----- .../eagle/engine_configs/prefill_config.yaml | 12 +++++++----- 3 files changed, 24 insertions(+), 15 deletions(-) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml index d6528c891c..e3f48e8076 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -21,18 +21,22 @@ backend: pytorch tensor_parallel_size: 4 moe_expert_parallel_size: 1 -# enable_attention_dp: true max_batch_size: 256 # 8448 = 8192 ISL + 256 OSL max_num_tokens: 8448 max_seq_len: 8448 -kv_cache_config: - free_gpu_memory_fraction: 0.25 -# Enable the MTP(Multi-Token Prediction) in the model engine +# Enable Speculative Decoding in the model engine speculative_config: decoding_type: Eagle - num_nextn_predict_layers: 1 + max_draft_len: 1 + pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 + +kv_cache_config: + free_gpu_memory_fraction: 0.5 + enable_block_reuse: false + +disable_overlap_scheduler: true use_cuda_graph: true cuda_graph_padding_enabled: true diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml index 16bca4c893..f0dfe4555f 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml @@ -21,7 +21,6 @@ backend: pytorch tensor_parallel_size: 4 moe_expert_parallel_size: 4 -# enable_attention_dp: false max_batch_size: 256 # Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula: # max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1) @@ -29,13 +28,17 @@ max_batch_size: 256 max_num_tokens: 512 # 8704 = 8192 ISL + 512 OSL max_seq_len: 8704 -kv_cache_config: - free_gpu_memory_fraction: 0.85 +disable_overlap_scheduler: true -# Enable the MTP(Multi-Token Prediction) in decode model engine +# Enable Speculative Decoding in the model engine speculative_config: decoding_type: Eagle - num_nextn_predict_layers: 1 + max_draft_len: 1 + pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 + +kv_cache_config: + free_gpu_memory_fraction: 0.5 + enable_block_reuse: false use_cuda_graph: true cuda_graph_padding_enabled: true diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml index 6286876f45..76e2ee26fc 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml @@ -21,17 +21,19 @@ backend: pytorch tensor_parallel_size: 4 moe_expert_parallel_size: 4 -# enable_attention_dp: true max_batch_size: 1 max_num_tokens: 8192 max_seq_len: 8192 -kv_cache_config: - free_gpu_memory_fraction: 0.75 print_iter_log: true kv_cache_dtype: fp8 disable_overlap_scheduler: true -# Enable the MTP(Multi-Token Prediction) in the prefill model engine +# Enable Speculative Decoding in the model engine speculative_config: decoding_type: Eagle - num_nextn_predict_layers: 1 + max_draft_len: 1 + pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 + +kv_cache_config: + free_gpu_memory_fraction: 0.5 + enable_block_reuse: false From f0a570766d504dccad123bbe23c0dab32f133492 Mon Sep 17 00:00:00 2001 From: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com> Date: Thu, 10 Jul 2025 09:04:21 -0700 Subject: [PATCH 10/20] Adding eagle3_one_model key-value Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com> --- .../configs/llama4/eagle/engine_configs/prefill_config.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml index 76e2ee26fc..a356fc0a37 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml @@ -33,6 +33,7 @@ speculative_config: decoding_type: Eagle max_draft_len: 1 pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 + eagle3_one_model: False kv_cache_config: free_gpu_memory_fraction: 0.5 From 48b7786fc4ab145346a1dc38f04f8d2797438082 Mon Sep 17 00:00:00 2001 From: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com> Date: Thu, 10 Jul 2025 09:04:32 -0700 Subject: [PATCH 11/20] Adding eagle3_one_model key-value Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com> --- .../configs/llama4/eagle/engine_configs/decode_config.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml index f0dfe4555f..bf57dde642 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml @@ -35,6 +35,7 @@ speculative_config: decoding_type: Eagle max_draft_len: 1 pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 + eagle3_one_model: False kv_cache_config: free_gpu_memory_fraction: 0.5 From 142e0eafc9aa98bd9d110fa0a1a8e3d7dbf718f9 Mon Sep 17 00:00:00 2001 From: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com> Date: Thu, 10 Jul 2025 09:04:43 -0700 Subject: [PATCH 12/20] Adding eagle3_one_model key-value Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com> --- .../configs/llama4/eagle/engine_configs/agg_config.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml index e3f48e8076..65af4b7ac3 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -31,6 +31,7 @@ speculative_config: decoding_type: Eagle max_draft_len: 1 pytorch_weights_path: nvidia/Llama-4-Maverick-17B-128E-Eagle3 + eagle3_one_model: False kv_cache_config: free_gpu_memory_fraction: 0.5 From 0985d105c9727b1101aad9c6aebabce03a3b332c Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Fri, 11 Jul 2025 16:45:30 -0700 Subject: [PATCH 13/20] Update Config --- .../configs/llama4/eagle/eagle_agg.yaml | 6 +++--- .../configs/llama4/eagle/eagle_disagg.yaml | 12 ++++------- .../eagle/engine_configs/agg_config.yaml | 21 +++++++------------ .../eagle/engine_configs/decode_config.yaml | 11 ++-------- .../eagle/engine_configs/prefill_config.yaml | 8 ++----- 5 files changed, 19 insertions(+), 39 deletions(-) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml index bcf97cfa64..91e1112008 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml @@ -15,14 +15,14 @@ Frontend: # This is the client-facing model name, you can set this to anything you'd like. - served_model_name: "meta-llama/Llama-4-Maverick-17B-128E" + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" endpoint: dynamo.TensorRTLLMWorker.generate port: 8000 router: round-robin TensorRTLLMWorker: - served_model_name: "meta-llama/Llama-4-Maverick-17B-128E" - model-path: "meta-llama/Llama-4-Maverick-17B-128E" + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" + model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" extra-engine-args: "configs/llama4/eagle/engine_configs/agg_config.yaml" router: round-robin ServiceArgs: diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml index 429acd1d12..c255bcf8f5 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml @@ -14,14 +14,14 @@ # limitations under the License. Frontend: - served_model_name: "meta-llama/Llama-4-Maverick-17B-128E" + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" endpoint: dynamo.TensorRTLLMWorker.generate port: 8000 router: round-robin TensorRTLLMWorker: - served_model_name: "meta-llama/Llama-4-Maverick-17B-128E" - model-path: "meta-llama/Llama-4-Maverick-17B-128E" + served_model_name: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" + model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. extra-engine-args: "configs/llama4/eagle/engine_configs/decode_config.yaml" @@ -33,11 +33,7 @@ TensorRTLLMWorker: gpu: 8 TensorRTLLMPrefillWorker: - # NOTE: FP4 only supported starting with Blackwell GPUs. - # https://huggingface.co/nvidia/DeepSeek-R1-FP4 - # You can also specify the full path to locally downloaded weights - # instead of a HuggingFace ID here. - model-path: "meta-llama/Llama-4-Maverick-17B-128E" + model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" # Path to a YAML file containing additional keyword arguments to pass to the TRTLLM engine. # The fields in `extra-engine-args` holds higher priority than the above TRTLLM engine fields. extra-engine-args: "configs/llama4/eagle/engine_configs/prefill_config.yaml" diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml index 65af4b7ac3..83bf12286e 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -13,18 +13,15 @@ # See the License for the specific language governing permissions and # limitations under the License. -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - backend: pytorch -tensor_parallel_size: 4 -moe_expert_parallel_size: 1 +tensor_parallel_size: 8 +moe_expert_parallel_size: 4 max_batch_size: 256 -# 8448 = 8192 ISL + 256 OSL -max_num_tokens: 8448 -max_seq_len: 8448 +# When max_num_tokens set to higher values, can cause OOM issues. +# Will be investigated in the future with TRTLLM team. +max_num_tokens: 1024 +max_seq_len: 1024 +autotuner_enabled: false # Enable Speculative Decoding in the model engine speculative_config: @@ -35,9 +32,7 @@ speculative_config: kv_cache_config: free_gpu_memory_fraction: 0.5 - enable_block_reuse: false - -disable_overlap_scheduler: true + enable_block_reuse: false use_cuda_graph: true cuda_graph_padding_enabled: true diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml index bf57dde642..4b595d2126 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/decode_config.yaml @@ -13,22 +13,15 @@ # See the License for the specific language governing permissions and # limitations under the License. -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - backend: pytorch tensor_parallel_size: 4 moe_expert_parallel_size: 4 max_batch_size: 256 -# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula: -# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1) -# This is a known issue in TensorRT-LLM and will be resolved in the next release. max_num_tokens: 512 # 8704 = 8192 ISL + 512 OSL max_seq_len: 8704 disable_overlap_scheduler: true +autotuner_enabled: false # Enable Speculative Decoding in the model engine speculative_config: @@ -39,7 +32,7 @@ speculative_config: kv_cache_config: free_gpu_memory_fraction: 0.5 - enable_block_reuse: false + enable_block_reuse: false use_cuda_graph: true cuda_graph_padding_enabled: true diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml index a356fc0a37..8442e478ba 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/prefill_config.yaml @@ -13,11 +13,6 @@ # See the License for the specific language governing permissions and # limitations under the License. -# NOTE: FP4 only supported starting with Blackwell GPUs. -# https://huggingface.co/nvidia/DeepSeek-R1-FP4 -# You can also specify the full path to locally downloaded weights -# instead of a HuggingFace ID here. - backend: pytorch tensor_parallel_size: 4 moe_expert_parallel_size: 4 @@ -27,6 +22,7 @@ max_seq_len: 8192 print_iter_log: true kv_cache_dtype: fp8 disable_overlap_scheduler: true +autotuner_enabled: false # Enable Speculative Decoding in the model engine speculative_config: @@ -37,4 +33,4 @@ speculative_config: kv_cache_config: free_gpu_memory_fraction: 0.5 - enable_block_reuse: false + enable_block_reuse: false From 8bd09ad7e07df1c7125f782534c3b8b20377c36e Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Fri, 11 Jul 2025 16:50:25 -0700 Subject: [PATCH 14/20] Update GPU/TP Count --- examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml | 2 +- examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml | 4 ++-- .../configs/llama4/eagle/engine_configs/agg_config.yaml | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml index 91e1112008..fe4a94df4b 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_agg.yaml @@ -28,4 +28,4 @@ TensorRTLLMWorker: ServiceArgs: workers: 1 resources: - gpu: 8 + gpu: 4 diff --git a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml index c255bcf8f5..3bfe111fac 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/eagle_disagg.yaml @@ -30,7 +30,7 @@ TensorRTLLMWorker: ServiceArgs: workers: 1 resources: - gpu: 8 + gpu: 4 TensorRTLLMPrefillWorker: model-path: "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8" @@ -41,4 +41,4 @@ TensorRTLLMPrefillWorker: ServiceArgs: workers: 1 resources: - gpu: 8 + gpu: 4 diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml index 83bf12286e..caa3c9ea3c 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -14,7 +14,7 @@ # limitations under the License. backend: pytorch -tensor_parallel_size: 8 +tensor_parallel_size: 4 moe_expert_parallel_size: 4 max_batch_size: 256 # When max_num_tokens set to higher values, can cause OOM issues. From a4ed3f77afe68f6e03cae4089107a830c868f721 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Fri, 11 Jul 2025 16:53:00 -0700 Subject: [PATCH 15/20] Adding back disable_overlap_scheduler key-value --- .../configs/llama4/eagle/engine_configs/agg_config.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml index caa3c9ea3c..6cc1305a24 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -22,6 +22,7 @@ max_batch_size: 256 max_num_tokens: 1024 max_seq_len: 1024 autotuner_enabled: false +disable_overlap_scheduler: true # Enable Speculative Decoding in the model engine speculative_config: From d054d530d2f8a4846f91856d1ec500c26a11dc27 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Fri, 11 Jul 2025 16:56:02 -0700 Subject: [PATCH 16/20] Updating max_seq_len --- .../configs/llama4/eagle/engine_configs/agg_config.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml index 6cc1305a24..1bed25ef27 100644 --- a/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml +++ b/examples/tensorrt_llm/configs/llama4/eagle/engine_configs/agg_config.yaml @@ -20,7 +20,7 @@ max_batch_size: 256 # When max_num_tokens set to higher values, can cause OOM issues. # Will be investigated in the future with TRTLLM team. max_num_tokens: 1024 -max_seq_len: 1024 +max_seq_len: 8448 autotuner_enabled: false disable_overlap_scheduler: true From c5fc5c8a884130f1fd688d9202938cdf98f59be5 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Fri, 11 Jul 2025 18:46:08 -0700 Subject: [PATCH 17/20] Updating README.md --- examples/tensorrt_llm/README.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/examples/tensorrt_llm/README.md b/examples/tensorrt_llm/README.md index f844a56d94..7e8e086a39 100644 --- a/examples/tensorrt_llm/README.md +++ b/examples/tensorrt_llm/README.md @@ -350,3 +350,25 @@ unset TRTLLM_USE_NIXL_KVCACHE export TRTLLM_USE_UCX_KVCACHE=1 ``` + +### Example architectures for Llama 4 Maverick Instruct + Eagle Speculative Decoding + +#### Notes +* The current example has been tested out on a GB200x4 node. +* To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria: + * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21) + * It includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975) + +##### Aggregated Serving +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_agg.yaml +``` +* Known Issue: In Aggregated Serving, when the `max_num_tokens` was set to higher values, in our case 8448, we experienced Out of Memory (OOM) errors. This is being investigated by the TRTLLM team. + + +##### Disaggregated Serving +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_disagg.yaml +``` \ No newline at end of file From 25ba4f150e512b07fbb7763c271c4231aa244661 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Fri, 11 Jul 2025 18:48:35 -0700 Subject: [PATCH 18/20] Fix wording --- examples/tensorrt_llm/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/tensorrt_llm/README.md b/examples/tensorrt_llm/README.md index 7e8e086a39..b447eef45f 100644 --- a/examples/tensorrt_llm/README.md +++ b/examples/tensorrt_llm/README.md @@ -357,7 +357,7 @@ export TRTLLM_USE_UCX_KVCACHE=1 * The current example has been tested out on a GB200x4 node. * To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria: * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21) - * It includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975) + * The TensorRT-LLM build includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975) ##### Aggregated Serving ```bash From 1de9595b10948ff9bfb895707b26fb38202398f5 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Fri, 11 Jul 2025 19:01:34 -0700 Subject: [PATCH 19/20] Adding to README.md --- examples/tensorrt_llm/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/tensorrt_llm/README.md b/examples/tensorrt_llm/README.md index b447eef45f..d94f312ca7 100644 --- a/examples/tensorrt_llm/README.md +++ b/examples/tensorrt_llm/README.md @@ -358,14 +358,14 @@ export TRTLLM_USE_UCX_KVCACHE=1 * To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria: * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21) * The TensorRT-LLM build includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975) +* If you need to download model weights off huggingface, make sure you run the command `huggingface-cli login` and have access to the necessary gated models. ##### Aggregated Serving ```bash cd /workspace/examples/tensorrt_llm dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_agg.yaml ``` -* Known Issue: In Aggregated Serving, when the `max_num_tokens` was set to higher values, in our case 8448, we experienced Out of Memory (OOM) errors. This is being investigated by the TRTLLM team. - +* Known Issue: In Aggregated Serving, setting `max_num_tokens` to higher values (e.g. `max_num_tokens: 8448`) can lead to Out of Memory (OOM) errors. This is being investigated by the TRTLLM team. ##### Disaggregated Serving ```bash From fabab466564e47fd25ed76cd04bd33e1bc034c71 Mon Sep 17 00:00:00 2001 From: Krishnan Prashanth Date: Mon, 14 Jul 2025 10:26:23 -0700 Subject: [PATCH 20/20] Updates to disaggregate workflow --- examples/tensorrt_llm/README.md | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/examples/tensorrt_llm/README.md b/examples/tensorrt_llm/README.md index d94f312ca7..b6169efd54 100644 --- a/examples/tensorrt_llm/README.md +++ b/examples/tensorrt_llm/README.md @@ -354,7 +354,9 @@ export TRTLLM_USE_UCX_KVCACHE=1 ### Example architectures for Llama 4 Maverick Instruct + Eagle Speculative Decoding #### Notes -* The current example has been tested out on a GB200x4 node. +* Testing for the current example used: + * One GB200x4 node for aggregate serving + * Two GB200x4 nodes for disaggregate serving * To run Eagle Speculative Decoding with Llama 4, ensure the container meets the following criteria: * Built with a version of TensorRT-LLM based on the 0.21 release [Link](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.21) * The TensorRT-LLM build includes the changes from this PR [Link](https://github.com/NVIDIA/TensorRT-LLM/pull/5975) @@ -368,7 +370,31 @@ dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_agg.yaml * Known Issue: In Aggregated Serving, setting `max_num_tokens` to higher values (e.g. `max_num_tokens: 8448`) can lead to Out of Memory (OOM) errors. This is being investigated by the TRTLLM team. ##### Disaggregated Serving + +###### Head Node +Start nats/etcd +``` bash +nats-server -js & +etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd & +``` + +Launch graph of Frontend and TensorRTLLMWorker (decode) on head node: + +```bash +cd /workspace/examples/tensorrt_llm +dynamo serve graphs.agg:Frontend -f configs/llama4/eagle/eagle_disagg.yaml & +``` + +###### Worker Node(s) +Set environment variables pointing at the etcd/nats endpoints on the head node. +```bash +export HEAD_NODE_IP="" +export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" +export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" +``` + +Deploy a Prefill worker: ```bash cd /workspace/examples/tensorrt_llm -dynamo serve graphs.disagg:Frontend -f configs/llama4/eagle/eagle_disagg.yaml +dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/llama4/eagle/eagle_disagg.yaml --service-name TensorRTLLMPrefillWorker & ``` \ No newline at end of file